review.tizen.org Git - platform/kernel/linux-starfive.git/log

x86: link vdso and boot with -z noexecstack --no-warn-rwx-segments

Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:

  ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack
  ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
  ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions

Generally, we would like to avoid the stack being executable.  Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack.  Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.

LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO.  --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.

While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.

Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Makefile: link with -z noexecstack --no-warn-rwx-segments

Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:

  ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack
  ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
  ld: warning: vmlinux has a LOAD segment with RWX permissions

Generally, we would like to avoid the stack being executable.  Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack.  Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.

LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO.  --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.

While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.

Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

crypto: blake2b: effectively disable frame size warning

It turns out that gcc-12.1 has some nasty problems with register
allocation on a 32-bit x86 build for the 64-bit values used in the
generic blake2b implementation, where the pattern of 64-bit rotates and
xor operations ends up making gcc generate horrible code.

As a result it ends up with a ridiculously large stack frame for all the
spills it generates, resulting in the following build problem:

crypto/blake2b_generic.c: In function ‘blake2b_compress_one_generic’:
crypto/blake2b_generic.c:109:1: error: the frame size of 2640 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

on the same test-case, clang ends up generating a stack frame that is
just 296 bytes (and older gcc versions generate a slightly bigger one at
428 bytes - still nowhere near that almost 3kB monster stack frame of
gcc-12.1).

The issue is fixed both in mainline and the GCC 12 release branch [1],
but current release compilers end up failing the i386 allmodconfig build
due to this issue.

Disable the warning for now by simply raising the frame size for this
one file, just to keep this issue from having people turn off WERROR.

Link: https://lore.kernel.org/all/CAHk-=wjxqgeG2op+=W9sqgsWqCYnavC+SRfVyopu9-31S6xw+Q@mail.gmail.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
"Highlights include:

  Stable fixes:
   - pNFS/flexfiles: Fix infinite looping when the RDMA connection
     errors out

  Bugfixes:
   - NFS: fix port value parsing
   - SUNRPC: Reinitialise the backchannel request buffers before reuse
   - SUNRPC: fix expiry of auth creds
   - NFSv4: Fix races in the legacy idmapper upcall
   - NFS: O_DIRECT fixes from Jeff Layton
   - NFSv4.1: Fix OP_SEQUENCE error handling
   - SUNRPC: Fix an RPC/RDMA performance regression
   - NFS: Fix case insensitive renames
   - NFSv4/pnfs: Fix a use-after-free bug in open
   - NFSv4.1: RECLAIM_COMPLETE must handle EACCES

  Features:
   - NFSv4.1: session trunking enhancements
   - NFSv4.2: READ_PLUS performance optimisations
   - NFS: relax the rules for rsize/wsize mount options
   - NFS: don't unhash dentry during unlink/rename
   - SUNRPC: Fail faster on bad verifier
   - NFS/SUNRPC: Various tracing improvements"

* tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
  NFS: Improve readpage/writepage tracing
  NFS: Improve O_DIRECT tracing
  NFS: Improve write error tracing
  NFS: don't unhash dentry during unlink/rename
  NFSv4/pnfs: Fix a use-after-free bug in open
  NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
  SUNRPC: Don't reuse bvec on retransmission of the request
  SUNRPC: Reinitialise the backchannel request buffers before reuse
  NFSv4.1: RECLAIM_COMPLETE must handle EACCES
  NFSv4.1 probe offline transports for trunking on session creation
  SUNRPC create a function that probes only offline transports
  SUNRPC export xprt_iter_rewind function
  SUNRPC restructure rpc_clnt_setup_test_and_add_xprt
  NFSv4.1 remove xprt from xprt_switch if session trunking test fails
  SUNRPC create an rpc function that allows xprt removal from rpc_clnt
  SUNRPC enable back offline transports in trunking discovery
  SUNRPC create an iterator to list only OFFLINE xprts
  NFSv4.1 offline trunkable transports on DESTROY_SESSION
  SUNRPC add function to offline remove trunkable transports
  SUNRPC expose functions for offline remote xprt functionality
  ...

Merge tag 'hwmon-fixes-for-v6.0-rc1' of git://git./linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:
"Fix two regressions in nct6775 and lm90 drivers"

* tag 'hwmon-fixes-for-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (nct6775) Fix platform driver suspend regression
hwmon: (lm90) Fix error return value from detect function

Merge tag 'rpmsg-v5.20-1' of git://git./linux/kernel/git/remoteproc/linux

Pull rpmsg fixes from Bjorn Andersson:
"This fixes schema validation warnings in the Devicetree bindings for
  SMD and SMD RPM"

* tag 'rpmsg-v5.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  dt-bindings: soc: qcom: smd-rpm: extend example
  dt-bindings: soc: qcom: smd: reference SMD edge schema

Merge tag 'mm-stable-2022-08-09' of git://git./linux/kernel/git/akpm/mm

Pull remaining MM updates from Andrew Morton:
"Three patch series - two that perform cleanups and one feature:

   - hugetlb_vmemmap cleanups from Muchun Song

   - hardware poisoning support for 1GB hugepages, from Naoya Horiguchi

   - highmem documentation fixups from Fabio De Francesco"

* tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
  Documentation/mm: add details about kmap_local_page() and preemption
  highmem: delete a sentence from kmap_local_page() kdocs
  Documentation/mm: rrefer kmap_local_page() and avoid kmap()
  Documentation/mm: avoid invalid use of addresses from kmap_local_page()
  Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
  highmem: specify that kmap_local_page() is callable from interrupts
  highmem: remove unneeded spaces in kmap_local_page() kdocs
  mm, hwpoison: enable memory error handling on 1GB hugepage
  mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  mm, hwpoison: make __page_handle_poison returns int
  mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
  mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
  mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
  mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst
  mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  mm: hugetlb_vmemmap: replace early_param() with core_param()
  mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  ...

Merge tag 'cxl-for-6.0' of git://git./linux/kernel/git/cxl/cxl

Pull cxl updates from Dan Williams:
"Compute Express Link (CXL) updates for 6.0:

   - Introduce a 'struct cxl_region' object with support for
     provisioning and assembling persistent memory regions.

   - Introduce alloc_free_mem_region() to accompany the existing
     request_free_mem_region() as a method to allocate physical memory
     capacity out of an existing resource.

   - Export insert_resource_expand_to_fit() for the CXL subsystem to
     late-publish CXL platform windows in iomem_resource.

   - Add a polled mode PCI DOE (Data Object Exchange) driver service and
     use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
     Table)"

* tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits)
  cxl/hdm: Fix skip allocations vs multiple pmem allocations
  cxl/region: Disallow region granularity != window granularity
  cxl/region: Fix x1 interleave to greater than x1 interleave routing
  cxl/region: Move HPA setup to cxl_region_attach()
  cxl/region: Fix decoder interleave programming
  Documentation: cxl: remove dangling kernel-doc reference
  cxl/region: describe targets and nr_targets members of cxl_region_params
  cxl/regions: add padding for cxl_rr_ep_add nested lists
  cxl/region: Fix IS_ERR() vs NULL check
  cxl/region: Fix region reference target accounting
  cxl/region: Fix region commit uninitialized variable warning
  cxl/region: Fix port setup uninitialized variable warnings
  cxl/region: Stop initializing interleave granularity
  cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime
  cxl/acpi: Minimize granularity for x1 interleaves
  cxl/region: Delete 'region' attribute from root decoders
  cxl/acpi: Autoload driver for 'cxl_acpi' test devices
  cxl/region: decrement ->nr_targets on error in cxl_region_attach()
  cxl/region: prevent underflow in ways_to_cxl()
  cxl/region: uninitialized variable in alloc_hpa()
  ...

Merge tag 'pinctrl-v6.0-1' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control updates from Linus Walleij:
"Outside the pinctrl driver and DT bindings we hit some Arm DT files,
  patched by the maintainers.

  Other than that it is business as usual.

  Core changes:

   - Add PINCTRL_PINGROUP() helper macro (and use it in the AMD driver).

  New drivers:

   - Intel Meteor Lake support.

   - Reneasas RZ/V2M and r8a779g0 (R-Car V4H).

   - AXP209 variants AXP221, AXP223 and AXP809.

   - Qualcomm MSM8909, PM8226, PMP8074 and SM6375.

   - Allwinner D1.

  Improvements:

   - Proper pin multiplexing in the AMD driver.

   - Mediatek MT8192 can use generic drive strength and pin bias, then
     fixes on top plus some I2C pin group fixes.

   - Have the Allwinner Sunplus SP7021 use the generic DT schema and
     make interrupts optional.

   - Handle Qualcomm SC7280 ADSP.

   - Handle Qualcomm MSM8916 CAMSS GP clock muxing.

   - High impedance bias on ZynqMP.

   - Serialize StarFive access to MMIO.

   - Immutable gpiochip for BCM2835, Ingenic, Qualcomm SPMI GPIO"

* tag 'pinctrl-v6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (117 commits)
  dt-bindings: pinctrl: qcom,pmic-gpio: add PM8226 constraints
  pinctrl: qcom: Make PINCTRL_SM8450 depend on PINCTRL_MSM
  pinctrl: qcom: sm8250: Fix PDC map
  pinctrl: amd: Fix an unused variable
  dt-bindings: pinctrl: mt8186: Add and use drive-strength-microamp
  dt-bindings: pinctrl: mt8186: Add gpio-line-names property
  ARM: dts: imxrt1170-pinfunc: Add pinctrl binding header
  pinctrl: amd: Use unicode for debugfs output
  pinctrl: amd: Fix newline declaration in debugfs output
  pinctrl: at91: Fix typo 'the the' in comment
  dt-bindings: pinctrl: st,stm32: Correct 'resets' property name
  pinctrl: mvebu: Missing a blank line after declarations.
  pinctrl: qcom: Add SM6375 TLMM driver
  dt-bindings: pinctrl: Add DT schema for SM6375 TLMM
  dt-bindings: pinctrl: mt8195: Use drive-strength-microamp in examples
  Revert "pinctrl: qcom: spmi-gpio: make the irqchip immutable"
  pinctrl: imx93: Add MODULE_DEVICE_TABLE()
  pinctrl: sunxi: Add driver for Allwinner D1
  pinctrl: sunxi: Make some layout parameters dynamic
  pinctrl: sunxi: Refactor register/offset calculation
  ...

Merge tag 'apparmor-pr-2022-08-08' of git://git./linux/kernel/git/jj/linux-apparmor

Pull AppArmor updates from John Johansen:
"This is mostly cleanups and bug fixes with the one bigger change being
  Mathew Wilcox's patch to use XArrays instead of the IDR from the
  thread around the locking weirdness.

  Features:
   - Convert secid mapping to XArrays instead of IDR
   - Add a kernel label to use on kernel objects
   - Extend policydb permission set by making use of the xbits
   - Make export of raw binary profile to userspace optional
   - Enable tuning of policy paranoid load for embedded systems
   - Don't create raw_sha1 symlink if sha1 hashing is disabled
   - Allow labels to carry debug flags

  Cleanups:
   - Update MAINTAINERS file
   - Use struct_size() helper in kmalloc()
   - Move ptrace mediation to more logical task.{h,c}
   - Resolve uninitialized symbol warnings
   - Remove redundant ret variable
   - Mark alloc_unconfined() as static
   - Update help description of policy hash for introspection
   - Remove some casts which are no-longer required

  Bug Fixes:
   - Fix aa_label_asxprint return check
   - Fix reference count leak in aa_pivotroot()
   - Fix memleak in aa_simple_write_to_buffer()
   - Fix kernel doc comments
   - Fix absroot causing audited secids to begin with =
   - Fix quiet_denied for file rules
   - Fix failed mount permission check error message
   - Disable showing the mode as part of a secid to secctx
   - Fix setting unconfined mode on a loaded profile
   - Fix overlapping attachment computation
   - Fix undefined reference to `zlib_deflate_workspacesize'"

* tag 'apparmor-pr-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor: (34 commits)
  apparmor: Update MAINTAINERS file with new email address
  apparmor: correct config reference to intended one
  apparmor: move ptrace mediation to more logical task.{h,c}
  apparmor: extend policydb permission set by making use of the xbits
  apparmor: allow label to carry debug flags
  apparmor: fix overlapping attachment computation
  apparmor: fix setting unconfined mode on a loaded profile
  apparmor: Fix some kernel-doc comments
  apparmor: Mark alloc_unconfined() as static
  apparmor: disable showing the mode as part of a secid to secctx
  apparmor: Convert secid mapping to XArrays instead of IDR
  apparmor: add a kernel label to use on kernel objects
  apparmor: test: Remove some casts which are no-longer required
  apparmor: Fix memleak in aa_simple_write_to_buffer()
  apparmor: fix reference count leak in aa_pivotroot()
  apparmor: Fix some kernel-doc comments
  apparmor: Fix undefined reference to `zlib_deflate_workspacesize'
  apparmor: fix aa_label_asxprint return check
  apparmor: Fix some kernel-doc comments
  apparmor: Fix some kernel-doc comments
  ...

Merge tag 'kbuild-v5.20' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

- Remove the support for -O3 (CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3)

- Fix error of rpm-pkg cross-builds

- Support riscv for checkstack tool

- Re-enable -Wformwat warnings for Clang

- Clean up modpost, Makefiles, and misc scripts

* tag 'kbuild-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (30 commits)
  modpost: remove .symbol_white_list field entirely
  modpost: remove unneeded .symbol_white_list initializers
  modpost: add PATTERNS() helper macro
  modpost: shorten warning messages in report_sec_mismatch()
  Revert "Kbuild, lto, workaround: Don't warn for initcall_reference in modpost"
  modpost: use more reliable way to get fromsec in section_rel(a)()
  modpost: add array range check to sec_name()
  modpost: refactor get_secindex()
  kbuild: set EXIT trap before creating temporary directory
  modpost: remove unused Elf_Sword macro
  Makefile.extrawarn: re-enable -Wformat for clang
  kbuild: add dtbs_prepare target
  kconfig: Qt5: tell the user which packages are required
  modpost: use sym_get_data() to get module device_table data
  modpost: drop executable ELF support
  checkstack: add riscv support for scripts/checkstack.pl
  kconfig: shorten the temporary directory name for cc-option
  scripts: headers_install.sh: Update config leak ignore entries
  kbuild: error out if $(INSTALL_MOD_PATH) contains % or :
  kbuild: error out if $(KBUILD_EXTMOD) contains % or :
  ...

hwmon: (nct6775) Fix platform driver suspend regression

Commit c3963bc0a0cf ("hwmon: (nct6775) Split core and platform
driver") introduced a slight change in nct6775_suspend() in order to
avoid an otherwise-needless symbol export for nct6775_update_device(),
replacing a call to that function with a simple dev_get_drvdata()
instead.

As it turns out, there is no guarantee that nct6775_update_device()
is ever called prior to suspend. If this happens, the resume function
ends up writing bad data into the various chip registers, which results
in a crash shortly after resume.

To fix the problem, just add the symbol export and return to using
nct6775_update_device() as was employed previously.

Reported-by: Zoltán Kővágó <dirty.ice.hu@gmail.com>
Tested-by: Zoltán Kővágó <dirty.ice.hu@gmail.com>
Fixes: c3963bc0a0cf ("hwmon: (nct6775) Split core and platform driver")
Cc: stable@kernel.org
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Link: https://lore.kernel.org/r/20220810052646.13825-1-zev@bewilderbeest.net
[groeck: Updated description]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

hwmon: (lm90) Fix error return value from detect function

lm90_detect_nuvoton() is supposed to return NULL if it can not detect
a chip, or a pointer to the chip name if it does. Under some circumstances
it returns an error pointer instead. Some versions of gcc interpret an
ERR_PTR as region of size 0 and generate an error message.

  In function ‘__fortify_strlen’,
      inlined from ‘strlcpy’ at ./include/linux/fortify-string.h:159:10,
      inlined from ‘lm90_detect’ at drivers/hwmon/lm90.c:2550:2:
  ./include/linux/fortify-string.h:50:33: error:
      ‘__builtin_strlen’ reading 1 or more bytes from a region of size 0
     50 | #define __underlying_strlen     __builtin_strlen
        |                                 ^
  ./include/linux/fortify-string.h:141:24: note:
      in expansion of macro ‘__underlying_strlen’
    141 |                 return __underlying_strlen(p);
        |                        ^~~~~~~~~~~~~~~~~~~

Returning NULL instead of ERR_PTR() fixes the problem.

Fixes: c7cebce984a2 ("hwmon: (lm90) Rework detect function")
Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>

add barriers to buffer_uptodate and set_buffer_uptodate

Let's have a look at this piece of code in __bread_slow:

get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, 0, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;

Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.

Fix this bug by adding a memory barrier to set_buffer_uptodate and an
acquire barrier to buffer_uptodate (in a similar way as
folio_test_uptodate and folio_mark_uptodate).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'nfsd-6.0' of git://git./linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
"Work on 'courteous server', which was introduced in 5.19, continues
  apace. This release introduces a more flexible limit on the number of
  NFSv4 clients that NFSD allows, now that NFSv4 clients can remain in
  courtesy state long after the lease expiration timeout. The client
  limit is adjusted based on the physical memory size of the server.

  The NFSD filecache is a cache of files held open by NFSv4 clients or
  recently touched by NFSv2 or NFSv3 clients. This cache had some
  significant scalability constraints that have been relieved in this
  release. Thanks to all who contributed to this work.

  A data corruption bug found during the most recent NFS bake-a-thon
  that involves NFSv3 and NFSv4 clients writing the same file has been
  addressed in this release.

  This release includes several improvements in CPU scalability for
  NFSv4 operations. In addition, Neil Brown provided patches that
  simplify locking during file lookup, creation, rename, and removal
  that enables subsequent work on making these operations more scalable.
  We expect to see that work materialize in the next release.

  There are also numerous single-patch fixes, clean-ups, and the usual
  improvements in observability"

* tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (78 commits)
  lockd: detect and reject lock arguments that overflow
  NFSD: discard fh_locked flag and fh_lock/fh_unlock
  NFSD: use (un)lock_inode instead of fh_(un)lock for file operations
  NFSD: use explicit lock/unlock for directory ops
  NFSD: reduce locking in nfsd_lookup()
  NFSD: only call fh_unlock() once in nfsd_link()
  NFSD: always drop directory lock in nfsd_unlink()
  NFSD: change nfsd_create()/nfsd_symlink() to unlock directory before returning.
  NFSD: add posix ACLs to struct nfsd_attrs
  NFSD: add security label to struct nfsd_attrs
  NFSD: set attributes when creating symlinks
  NFSD: introduce struct nfsd_attrs
  NFSD: verify the opened dentry after setting a delegation
  NFSD: drop fh argument from alloc_init_deleg
  NFSD: Move copy offload callback arguments into a separate structure
  NFSD: Add nfsd4_send_cb_offload()
  NFSD: Remove kmalloc from nfsd4_do_async_copy()
  NFSD: Refactor nfsd4_do_copy()
  NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2)
  NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2)
  ...

dt-bindings: soc: qcom: smd-rpm: extend example

Replace existing limited example with proper code for Qualcomm Resource
Power Manager (RPM) over SMD based on MSM8916. This also fixes the
example's indentation.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20220723082358.39544-2-krzysztof.kozlowski@linaro.org

dt-bindings: soc: qcom: smd: reference SMD edge schema

The child node of smd is an SMD edge representing remote subsystem.
Bring back missing reference from previously sent patch (disappeared
when applying).

Link: https://lore.kernel.org/r/20220517070113.18023-9-krzysztof.kozlowski@linaro.org
Fixes: 385fad1303af ("dt-bindings: remoteproc: qcom,smd-edge: define re-usable schema for smd-edge")
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20220723082358.39544-1-krzysztof.kozlowski@linaro.org

NFS: Improve readpage/writepage tracing

Switch formatting to better match that used by other NFS tracepoints.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

NFS: Improve O_DIRECT tracing

Switch the formatting to match the other NFS tracepoints.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

NFS: Improve write error tracing

Don't leak request pointers, but use the "device:inode" labelling that
is used by all the other trace points. Furthermore, replace use of page
indexes with an offset, again in order to align behaviour with other
NFS trace points.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

Merge tag 'fscache-fixes-20220809' of git://git./linux/kernel/git/dhowells/linux-fs

Pull fscache updates from David Howells:

- Fix a cookie access ref leak if a cookie is invalidated a second time
   before the first invalidation is actually processed.

- Add a tracepoint to log cookie lookup failure

* tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: add tracepoint when failing cookie
  fscache: don't leak cookie access refs if invalidation is in progress or failed

Merge tag 'afs-fixes-20220802' of git://git./linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
"Fix AFS refcount handling.

  The first patch converts afs to use refcount_t for its refcounts and
  the second patch fixes afs_put_call() and afs_put_server() to save the
  values they're going to log in the tracepoint before decrementing the
  refcount"

* tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix access after dec in put functions
  afs: Use refcount_t rather than atomic_t

Merge tag 'fs.setgid.v6.0' of git://git./linux/kernel/git/brauner/linux

Pull setgid updates from Christian Brauner:
"This contains the work to move setgid stripping out of individual
  filesystems and into the VFS itself.

  Creating files that have both the S_IXGRP and S_ISGID bit raised in
  directories that themselves have the S_ISGID bit set requires
  additional privileges to avoid security issues.

  When a filesystem creates a new inode it needs to take care that the
  caller is either in the group of the newly created inode or they have
  CAP_FSETID in their current user namespace and are privileged over the
  parent directory of the new inode. If any of these two conditions is
  true then the S_ISGID bit can be raised for an S_IXGRP file and if not
  it needs to be stripped.

  However, there are several key issues with the current implementation:

   - S_ISGID stripping logic is entangled with umask stripping.

     For example, if the umask removes the S_IXGRP bit from the file
     about to be created then the S_ISGID bit will be kept.

     The inode_init_owner() helper is responsible for S_ISGID stripping
     and is called before posix_acl_create(). So we can end up with two
     different orderings:

     1. FS without POSIX ACL support

        First strip umask then strip S_ISGID in inode_init_owner().

        In other words, if a filesystem doesn't support or enable POSIX
        ACLs then umask stripping is done directly in the vfs before
        calling into the filesystem:

     2. FS with POSIX ACL support

        First strip S_ISGID in inode_init_owner() then strip umask in
        posix_acl_create().

        In other words, if the filesystem does support POSIX ACLs then
        unmask stripping may be done in the filesystem itself when
        calling posix_acl_create().

     Note that technically filesystems are free to impose their own
     ordering between posix_acl_create() and inode_init_owner() meaning
     that there's additional ordering issues that influence S_ISGID
     inheritance.

     (Note that the commit message of commit 1639a49ccdce ("fs: move
     S_ISGID stripping into the vfs_*() helpers") gets the ordering
     between inode_init_owner() and posix_acl_create() the wrong way
     around. I realized this too late.)

   - Filesystems that don't rely on inode_init_owner() don't get S_ISGID
     stripping logic.

     While that may be intentional (e.g. network filesystems might just
     defer setgid stripping to a server) it is often just a security
     issue.

     Note that mandating the use of inode_init_owner() was proposed as
     an alternative solution but that wouldn't fix the ordering issues
     and there are examples such as afs where the use of
     inode_init_owner() isn't possible.

     In any case, we should also try the cleaner and generalized
     solution first before resorting to this approach.

   - We still have S_ISGID inheritance bugs years after the initial
     round of S_ISGID inheritance fixes:

       e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes")
       01ea173e103e ("xfs: fix up non-directory creation in SGID directories")
       fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories")

  All of this led us to conclude that the current state is too messy.
  While we won't be able to make it completely clean as
  posix_acl_create() is still a filesystem specific call we can improve
  the S_SIGD stripping situation quite a bit by hoisting it out of
  inode_init_owner() and into the respective vfs creation operations.

  The obvious advantage is that we don't need to rely on individual
  filesystems getting S_ISGID stripping right and instead can
  standardize the ordering between S_ISGID and umask stripping directly
  in the VFS.

  A few short implementation notes:

   - The stripping logic needs to happen in vfs_*() helpers for the sake
     of stacking filesystems such as overlayfs that rely on these
     helpers taking care of S_ISGID stripping.

   - Security hooks have never seen the mode as it is ultimately seen by
     the filesystem because of the ordering issue we mentioned. Nothing
     is changed for them. We simply continue to strip the umask before
     passing the mode down to the security hooks.

   - The following filesystems use inode_init_owner() and thus relied on
     S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
     hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
     overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
     bpf, tmpfs.

     We've audited all callchains as best as we could. More details can
     be found in the commit message to 1639a49ccdce ("fs: move S_ISGID
     stripping into the vfs_*() helpers")"

* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  ceph: rely on vfs for setgid stripping
  fs: move S_ISGID stripping into the vfs_*() helpers
  fs: Add missing umask strip in vfs_tmpfile
  fs: add mode_strip_sgid() helper

Merge tag 'memblock-v5.20-rc1' of git://git./linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

- An optimization in memblock_add_range() to reduce array traversals

- Improvements to the memblock test suite

* tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock test: Modify the obsolete description in README
  memblock tests: fix compilation errors
  memblock tests: change build options to run-time options
  memblock tests: remove completed TODO items
  memblock tests: set memblock_debug to enable memblock_dbg() messages
  memblock tests: add verbose output to memblock tests
  memblock tests: Makefile: add arguments to control verbosity
  memblock: avoid some repeat when add new range

Merge tag 'm68knommu-for-v5.20' of git://git./linux/kernel/git/gerg/m68knommu

Pull m68knommu fixes from Greg Ungerer:

- spelling in comment

- compilation when flexcan driver enabled

- sparse warning

* tag 'm68knommu-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68k: Fix syntax errors in comments
  m68k: coldfire: make symbol m523x_clk_lookup static
  m68k: coldfire/device.c: protect FLEXCAN blocks

Merge tag 'x86_bugs_pbrsb' of git://git./linux/kernel/git/tip/tip

Pull x86 eIBRS fixes from Borislav Petkov:
"More from the CPU vulnerability nightmares front:

  Intel eIBRS machines do not sufficiently mitigate against RET
  mispredictions when doing a VM Exit therefore an additional RSB,
  one-entry stuffing is needed"

* tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation: Add LFENCE to RSB fill sequence
  x86/speculation: Add RSB VM Exit protections

fscache: add tracepoint when failing cookie

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>

fscache: don't leak cookie access refs if invalidation is in progress or failed

It's possible for a request to invalidate a fscache_cookie will come in
while we're already processing an invalidation. If that happens we
currently take an extra access reference that will leak. Only call
__fscache_begin_cookie_access if the FSCACHE_COOKIE_DO_INVALIDATE bit
was previously clear.

Also, ensure that we attempt to clear the bit when the cookie is
"FAILED" and put the reference to avoid an access leak.

Fixes: 85e4ea1049c7 ("fscache: Fix invalidation/lookup race")
Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>

Merge tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd updates from Steve French:

- fixes for memory access bugs (out of bounds access, oops, leak)

- multichannel fixes

- session disconnect performance improvement, and session register
   improvement

- cleanup

* tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix heap-based overflow in set_ntacl_dacl()
  ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT
  ksmbd: prevent out of bound read for SMB2_WRITE
  ksmbd: fix use-after-free bug in smb2_tree_disconect
  ksmbd: fix memory leak in smb2_handle_negotiate
  ksmbd: fix racy issue while destroying session on multichannel
  ksmbd: use wait_event instead of schedule_timeout()
  ksmbd: fix kernel oops from idr_remove()
  ksmbd: add channel rwlock
  ksmbd: replace sessions list in connection with xarray
  MAINTAINERS: ksmbd: add entry for documentation
  ksmbd: remove unused ksmbd_share_configs_cleanup function

Merge tag 'pull-work.iov_iter-rebased' of git://git./linux/kernel/git/viro/vfs

Pull more iov_iter updates from Al Viro:

- more new_sync_{read,write}() speedups - ITER_UBUF introduction

- ITER_PIPE cleanups

- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics

- making ITER_PIPE take high-order pages without splitting them

- handling copy_page_from_iter() for high-order pages properly

* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  fix copy_page_from_iter() for compound destinations
  hugetlbfs: copy_page_to_iter() can deal with compound pages
  copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
  expand those iov_iter_advance()...
  pipe_get_pages(): switch to append_pipe()
  get rid of non-advancing variants
  ceph: switch the last caller of iov_iter_get_pages_alloc()
  9p: convert to advancing variant of iov_iter_get_pages_alloc()
  af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
  iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
  block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: saner helper for page array allocation
  fold __pipe_get_pages() into pipe_get_pages()
  ITER_XARRAY: don't open-code DIV_ROUND_UP()
  unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
  unify xarray_get_pages() and xarray_get_pages_alloc()
  unify pipe_get_pages() and pipe_get_pages_alloc()
  iov_iter_get_pages(): sanity-check arguments
  iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
  ...

fix copy_page_from_iter() for compound destinations

had been broken for ITER_BVEC et.al. since ever (OK, v3.17 when
ITER_BVEC had first appeared)...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

hugetlbfs: copy_page_to_iter() can deal with compound pages

... since April 2021

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

copy_page_to_iter(): don't split high-order page in case of ITER_PIPE

... just shove it into one pipe_buffer.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

expand those iov_iter_advance()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

pipe_get_pages(): switch to append_pipe()

now that we are advancing the iterator, there's no need to
treat the first page separately - just call append_pipe()
in a loop.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

get rid of non-advancing variants

mechanical change; will be further massaged in subsequent commits

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ceph: switch the last caller of iov_iter_get_pages_alloc()

here nothing even looks at the iov_iter after the call, so we couldn't
care less whether it advances or not.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9p: convert to advancing variant of iov_iter_get_pages_alloc()

that one is somewhat clumsier than usual and needs serious testing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()

... and adjust the callers

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()

... and untangle the cleanup on failure to add into pipe.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

block: convert to advancing variants of iov_iter_get_pages{,_alloc}()

... doing revert if we end up not using some pages

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()

Most of the users immediately follow successful iov_iter_get_pages()
with advancing by the amount it had returned.

Provide inline wrappers doing that, convert trivial open-coded
uses of those.

BTW, iov_iter_get_pages() never returns more than it had been asked
to; such checks in cifs ought to be removed someday...

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

iov_iter: saner helper for page array allocation

All call sites of get_pages_array() are essenitally identical now.
Replace with common helper...

Returns number of slots available in resulting array or 0 on OOM;
it's up to the caller to make sure it doesn't ask to zero-entry
array (i.e. neither maxpages nor size are allowed to be zero).

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fold __pipe_get_pages() into pipe_get_pages()

... and don't mangle maxsize there - turn the loop into counting
one instead.  Easier to see that we won't run out of array that
way.  Note that special treatment of the partial buffer in that
thing is an artifact of the non-advancing semantics of
iov_iter_get_pages() - if not for that, it would be append_pipe(),
same as the body of the loop that follows it.  IOW, once we make
iov_iter_get_pages() advancing, the whole thing will turn into
calculate how many pages do we want
allocate an array (if needed)
call append_pipe() that many times.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_XARRAY: don't open-code DIV_ROUND_UP()

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts

same as for pipes and xarrays; after that iov_iter_get_pages() becomes
a wrapper for __iov_iter_get_pages_alloc().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

unify xarray_get_pages() and xarray_get_pages_alloc()

same as for pipes

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

unify pipe_get_pages() and pipe_get_pages_alloc()

The differences between those two are
* pipe_get_pages() gets a non-NULL struct page ** value pointing to
preallocated array + array size.
* pipe_get_pages_alloc() gets an address of struct page ** variable that
contains NULL, allocates the array and (on success) stores its address in
that variable.

Not hard to combine - always pass struct page ***, have
the previous pipe_get_pages_alloc() caller pass ~0U as cap for
array size.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

iov_iter_get_pages(): sanity-check arguments

zero maxpages is bogus, but best treated as "just return 0";
NULL pages, OTOH, should be treated as a hard bug.

get rid of now completely useless checks in xarray_get_pages{,_alloc}().

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper

Incidentally, ITER_XARRAY did *not* free the sucker in case when
iter_xarray_populate_pages() returned 0...

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: fold data_start() and pipe_space_for_user() together

All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final buffer.

Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: cache the type of last buffer

We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
check if ->iov_offset is non-zero (i.e. that pipe is not empty)
if so, get the corresponding pipe_buffer and check its ->ops
if it's &default_pipe_buf_ops, we have an anon buffer.

Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
empty, no buffers occupied: 0
anon, with bytes up to N-1 filled: N
zero-copy, with bytes up to N-1 filled: -N

That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.

Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.

Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator.  About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe.  There are only two cases where we leave the
sane state:
1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
2) iov_iter copied, then something is put into the copy.  Since
they share the underlying pipe, the original gets behind.  When we
decide that we are done with the copy (original is not usable until then)
we advance the original.  direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data.  At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: clean iov_iter_revert()

Fold pipe_truncate() into it, clean up. We can release buffers
in the same loop where we walk backwards to the iterator beginning
looking for the place where the new position will be.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: clean pipe_advance() up

instead of setting ->iov_offset for new position and calling
pipe_truncate() to adjust ->len of the last buffer and discard
everything after it, adjust ->len at the same time we set ->iov_offset
and use pipe_discard_from() to deal with buffers past that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: lose iter_head argument of __pipe_get_pages()

it's only used to get to the partial buffer we can add to,
and that's always the last one, i.e. pipe->head - 1.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: fold push_pipe() into __pipe_get_pages()

Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.

Note that the only reason it's not a loop doing append_pipe() is
that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives

New helper: append_pipe().  Extends the last buffer if possible,
allocates a new one otherwise.  Returns page and offset in it
on success, NULL on failure.  iov_iter is advanced past the
data we've got.

Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way.  Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.

[Fix for braino caught by Liu Xinpeng <liuxp11@chinatelecom.cn> folded in]
[another braino fix, this time in copy_pipe_to_iter() and pipe_zero();
caught by testcase from Hugh Dickins]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: helpers for adding pipe buffers

There are only two kinds of pipe_buffer in the area used by ITER_PIPE.

1) anonymous - copy_to_iter() et.al. end up creating those and copying
data there.  They have zero ->offset, and their ->ops points to
default_pipe_page_ops.

2) zero-copy ones - those come from copy_page_to_iter(), and page
comes from caller.  ->offset is also caller-supplied - it might be
non-zero.  ->ops points to page_cache_pipe_buf_ops.

Move creation and insertion of those into helpers - push_anon(pipe, size)
and push_page(pipe, page, offset, size) resp., separating them from
the "could we avoid creating a new buffer by merging with the current
head?" logics.

Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ITER_PIPE: helper for getting pipe buffer by index

pipe_buffer instances of a pipe are organized as a ring buffer,
with power-of-2 size.  Indices are kept *not* reduced modulo ring
size, so the buffer refered to by index N is
pipe->bufs[N & (pipe->ring_size - 1)].

Ring size can change over the lifetime of a pipe, but not while
the pipe is locked.  So for any iov_iter primitives it's a constant.
Original conversion of pipes to this layout went overboard trying
to microoptimize that - calculating pipe->ring_size - 1, storing
it in a local variable and using through the function.  In some
cases it might be warranted, but most of the times it only
obfuscates what's going on in there.

Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
that and use it in the obvious cases.  More will follow...

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

splice: stop abusing iov_iter_advance() to flush a pipe

Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
with rather non-obvious use of iov_iter_advance() in there.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

switch new_sync_{read,write}() to ITER_UBUF

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

new iov_iter flavour - ITER_UBUF

Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Documentation/mm: add details about kmap_local_page() and preemption

What happens if a thread is preempted after mapping pages with
kmap_local_page() was questioned recently.[1]

Commit f3ba3c710ac5 ("mm/highmem: Provide kmap_local*") from Thomas
Gleixner explains clearly that on context switch, the maps of an outgoing
task are removed and the map of the incoming task are restored and that
kmap_local_page() can be invoked from both preemptible and atomic
contexts.[2]

Therefore, for the purpose to make it clearer that users can call
kmap_local_page() from contexts that allow preemption, rework a couple of
sentences and add further information in highmem.rst.

[1] https://lore.kernel.org/lkml/5303077.Sb9uPGUboI@opensuse/
[2] https://lore.kernel.org/all/20201118204007.468533059@linutronix.de/

Link: https://lkml.kernel.org/r/20220728154844.10874-8-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

highmem: delete a sentence from kmap_local_page() kdocs

kmap_local_page() should always be preferred in place of kmap() and
kmap_atomic(). "Only use when really necessary." is not consistent with
the Documentation/mm/highmem.rst and these kdocs it embeds.

Therefore, delete the above-mentioned sentence from kdocs.

Link: https://lkml.kernel.org/r/20220728154844.10874-7-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Documentation/mm: rrefer kmap_local_page() and avoid kmap()

The reasoning for converting kmap() to kmap_local_page() was questioned
recently.[1]

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) kmap() also requires global TLB invalidation when
its pool wraps and it might block when the mapping space is fully utilized
until a slot becomes available.

Warn users to avoid the use of kmap() and instead use kmap_local_page(),
by designing their code to map pages in the same context the mapping will
be used.

[1] https://lore.kernel.org/lkml/1891319.taCxCBeP46@opensuse/

Link: https://lkml.kernel.org/r/20220728154844.10874-6-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Documentation/mm: avoid invalid use of addresses from kmap_local_page()

Users of kmap_local_page() must be absolutely sure to not hand kernel
virtual address obtained calling kmap_local_page() on highmem pages to
other contexts because those pointers are thread local, therefore, they
are no longer valid across different contexts.

Extend the documentation of kmap_local_page() to warn users about the
above-mentioned potential invalid use of pointers returned by
kmap_local_page().

Link: https://lkml.kernel.org/r/20220728154844.10874-5-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Documentation/mm: don't kmap*() pages which can't come from HIGHMEM

There is no need to kmap*() pages which are guaranteed to come from
ZONE_NORMAL (or lower). Linux has currently several call sites of
kmap{,_atomic,_local_page}() on pages which are clearly known which can't
come from ZONE_HIGHMEM.

Therefore, add a paragraph to highmem.rst, to explain better that a plain
page_address() may be used for getting the address of pages which cannot
come from ZONE_HIGHMEM, although it is always safe to use
kmap_local_page() / kunmap_local() also on those pages.

Link: https://lkml.kernel.org/r/20220728154844.10874-4-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

highmem: specify that kmap_local_page() is callable from interrupts

In a recent thread about converting kmap() to kmap_local_page(), the
safety of calling kmap_local_page() was questioned.[1]

"any context" should probably be enough detail for users who want to know
whether or not kmap_local_page() can be called from interrupts. However,
Linux still has kmap_atomic() which might make users think they must use
the latter in interrupts.

Add "including interrupts" for better clarity.

[1] https://lore.kernel.org/lkml/3187836.aeNJFYEL58@opensuse/

Link: https://lkml.kernel.org/r/20220728154844.10874-3-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

highmem: remove unneeded spaces in kmap_local_page() kdocs

Patch series "highmem: Extend kmap_local_page() documentation", v2.

The Highmem interface is evolving and the current documentation does not
reflect the intended uses of each of the calls. Furthermore, after a
recent series of reworks, the differences of the calls can still be
confusing and may lead to the expanded use of calls which are deprecated.

This series is the second round of changes towards an enhanced
documentation of the Highmem's interface; at this stage the patches are
only focused to kmap_local_page().

In addition it also contains some minor clean ups.

This patch (of 7):

In the kdocs of kmap_local_page(), the description of @page starts after
several unnecessary spaces.

Therefore, remove those spaces.

Link: https://lkml.kernel.org/r/20220728154844.10874-1-fmdefrancesco@gmail.com
Link: https://lkml.kernel.org/r/20220728154844.10874-2-fmdefrancesco@gmail.com
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hwpoison: enable memory error handling on 1GB hugepage

Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.

Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage

Currently if memory_failure() (modified to remove blocking code with
subsequent patch) is called on a page in some 1GB hugepage, memory error
handling fails and the raw error page gets into leaked state.  The impact
is small in production systems (just leaked single 4kB page), but this
limits the testability because unpoison doesn't work for it.  We can no
longer create 1GB hugepage on the 1GB physical address range with such
leaked pages, that's not useful when testing on small systems.

When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
broken down into raw error pages before coming to this point:

        if (unlikely(PageHWPoison(page)) && !order) {
                ...
                return false;
        }

Then, the page is not sent to buddy and the page refcount is left 0.

Originally this check is supposed to work when the error page is freed
from page_handle_poison() (that is called from soft-offline), but now we
are opening another path to call it, so the callers of
__page_handle_poison() need to handle the case by considering the return
value 0 as success.  Then page refcount for hwpoison is properly
incremented so unpoison works.

Link: https://lkml.kernel.org/r/20220714042420.1847125-8-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hwpoison: make __page_handle_poison returns int

__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now. But we will want to
distinguish another case of "dissolve has passed but taking off failed" by
its return value. So change the type of the return value. No functional
change.

Link: https://lkml.kernel.org/r/20220714042420.1847125-7-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hwpoison: set PG_hwpoison for busy hugetlb pages

If memory_failure() fails to grab page refcount on a hugetlb page because
it's busy, it returns without setting PG_hwpoison on it. This not only
loses a chance of error containment, but breaks the rule that
action_result() should be called only when memory_failure() do any of
handling work (even if that's just setting PG_hwpoison). This
inconsistency could harm code maintainability.

So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev
Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage

Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisoned. And unpoison handler needs to know how many errors there are
in the target hugepage. So add them.

HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
can't be unpoisoned, so skip them.

Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hwpoison, hugetlb: support saving mechanism of raw error pages

When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages.  If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right.
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage.  It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error pages is lost.

Use the private field of a few tail pages to keep that information.  The
code path of shrinking hugepage pool uses this info to try delayed
dissolve.  In order to remember multiple errors in a hugepage, a
singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
constructed.  Only simple operations (adding an entry or clearing all) are
required and the list is assumed not to be very long, so this simple data
structure should be enough.

If we failed to save raw error info, the hwpoison hugepage has errors on
unknown subpage, then this new saving mechanism does not work any more, so
disable saving new raw error info and freeing hwpoison hugepages.

Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry

follow_pud_mask() does not support non-present pud entry now. As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen. But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.

Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
The changes are similar to previous works for pud entries commit
e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and
commit cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present
hugepage").

Link: https://lkml.kernel.org/r/20220714042420.1847125-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()

Patch series "mm, hwpoison: enable 1GB hugepage support", v7.

This patch (of 8):

I found a weird state of 1GB hugepage pool, caused by the following
procedure:

  - run a process reserving all free 1GB hugepages,
  - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
    /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
  - kill the reserving process.

, then all the hugepages are free *and* surplus at the same time.

  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
  0
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
  3

This state is resolved by reserving and allocating the pages then freeing
them again, so this seems not to result in serious problem.  But it's a
little surprising (shrinking pool suddenly fails).

This behavior is caused by hstate_is_gigantic() check in
return_unused_surplus_pages().  This was introduced so long ago in 2008 by
commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and at
that time the gigantic pages were not supposed to be allocated/freed at
run-time.  Now kernel can support runtime allocation/free, so let's check
gigantic_page_runtime_supported() together.

Link: https://lkml.kernel.org/r/20220714042420.1847125-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20220714042420.1847125-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE

There is already a macro PTRS_PER_PTE to represent the number of page
table entries, just use it.

Link: https://lkml.kernel.org/r/20220628092235.91270-9-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst

All the comments which explains how HVO works are moved to
vmemmap_dedup.rst since

commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")

except some comments above page_fixed_fake_head(). This commit moves
those comments to vmemmap_dedup.rst and improve vmemmap_dedup.rst as well.

Link: https://lkml.kernel.org/r/20220628092235.91270-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability

There is a discussion about the name of hugetlb_vmemmap_alloc/free in
thread [1].  The suggestion suggested by David is rename "alloc/free" to
"optimize/restore" to make functionalities clearer to users, "optimize"
means the function will optimize vmemmap pages, while "restore" means
restoring its vmemmap pages discared before.  This commit does this.

Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used
explicitly for vmemmap_addr but implicitly for vmemmap_end in
hugetlb_vmemmap_alloc/free.  David suggested we can compute what
hugetlb_vmemmap_init() does now at runtime.  We do not need to worry for
the overhead of computing at runtime since the calculation is simple
enough and those functions are not in a hot path.  This commit has the
following improvements:

  1) The function suffixed name ("optimize/restore") is more expressive.
  2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore().
  3) The hugetlb_vmemmap_init() does not need to be exported anymore.
  4) A ->optimize_vmemmap_pages field in struct hstate is killed.
  5) There is only one place where checks is_power_of_2(sizeof(struct
     page)) instead of two places.
  6) Add more comments for hugetlb_vmemmap_optimize/restore().
  7) For external users, hugetlb_optimize_vmemmap_pages() is used for
     detecting if the HugeTLB's vmemmap pages is optimizable originally.
     In this commit, it is killed and we introduce a new helper
     hugetlb_vmemmap_optimizable() to replace it.  The name is more
     expressive.

Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/
Link: https://lkml.kernel.org/r/20220628092235.91270-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: replace early_param() with core_param()

After the following commit:

78f39084b41d ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl")

There is no order requirement between the parameter of
"hugetlb_free_vmemmap" and "hugepages" since we have removed the check of
whether HVO is enabled from hugetlb_vmemmap_init(). Therefore we can
safely replace early_param() with core_param() to simplify the code.

Link: https://lkml.kernel.org/r/20220628092235.91270-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c

When I first introduced vmemmap manipulation functions related to HugeTLB,
I thought those functions may be reused by other modules (e.g.  using
similar approach to optimize vmemmap pages, unfortunately, the DAX used
the same approach but does not use those functions).  After two years, we
didn't see any other users.  So move those functions to hugetlb_vmemmap.c.
Code movement without any functional change.

Link: https://lkml.kernel.org/r/20220628092235.91270-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: introduce the name HVO

It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced.
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.

This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].

Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/
Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling

We hold an another reference to hugetlb_optimize_vmemmap_key when making
vmemmap_optimize_mode on, because we use static_key to tell memory_hotplug
that memory_hotplug.memmap_on_memory should be overridden.  However, this
rule has gone when we have introduced PageVmemmapSelfHosted.  Therefore,
we could simplify vmemmap_optimize_mode handling by not holding an another
reference to hugetlb_optimize_vmemmap_key.  This also means that we not
incur the extra page_fixed_fake_head checks if there are no vmemmap
optinmized hugetlb pages after this change.

Link: https://lkml.kernel.org/r/20220628092235.91270-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()

Patch series "Simplify hugetlb vmemmap and improve its readability", v2.

This series aims to simplify hugetlb vmemmap and improve its readability.

This patch (of 8):

The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests
two conditions (enabled and pages in use).  Instead of coming up to an
appropriate name, we could just delete it.  There is already a discussion
about deleting it in thread [1].

There is only one user of hugetlb_optimize_vmemmap_enabled() outside of
hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c.
However, it does not need to call hugetlb_optimize_vmemmap_enabled() in
flush_dcache_page() since HugeTLB pages are always fully mapped and only
head page will be set PG_dcache_clean meaning only head page's flag may
need to be cleared (see commit cf5a501d985b).  So it is easy to remove
hugetlb_optimize_vmemmap_enabled().

Link: https://lore.kernel.org/all/c77c61c8-8a5a-87e8-db89-d04d8aaab4cc@oracle.com/
Link: https://lkml.kernel.org/r/20220628092235.91270-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Merge tag 'rproc-v5.20' of git://git./linux/kernel/git/remoteproc/linux

Pull remoteproc updates from Bjorn Andersson:
"This introduces support for the remoteproc on Mediatek MT8188, and
  enables caches for MT8186 SCP. It adds support for PRU cores found on
  the TI K3 AM62x SoCs.

  It moves the recovery work after a firmware crash to an unbound
  workqueue, to allow recovery to happen in parallel.

  A new DMA API is introduced to release dma_mem for a device.

  It adds support a panic handler for the Qualcomm modem remoteproc,
  with the goal of having caches flushed in memory dumps for post-mortem
  debugging and it introduces a mechanism to wait for the modem firmware
  on SM8450 to decrypt part of its memory for post-mortem debugging.

  Qualcomm sysmon is restricted to only inform remote processors about
  peers that are actually running, to avoid a race where Linux tries to
  notify a recovering remote processor about its peers new state. A
  mechanism for waiting for the sysmon connection to be established is
  also introduced, to avoid out-of-sync updates for rapidly restarting
  remote processors.

  A number of Devicetree binding cleanups and conversions to YAML are
  introduced, to facilitate Devicetree validation. Lastly it introduces
  a number of smaller fixes and cleanups in the core and a few different
  drivers"

* tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: (42 commits)
  remoteproc: qcom_q6v5_pas: Do not fail if regulators are not found
  drivers/remoteproc: fix repeated words in comments
  remoteproc: Directly use ida_alloc()/free()
  remoteproc: Use unbounded workqueue for recovery work
  remoteproc: using pm_runtime_resume_and_get instead of pm_runtime_get_sync
  remoteproc: qcom_q6v5_pas: Deal silently with optional px and cx regulators
  remoteproc: sysmon: Send sysmon state only for running rprocs
  remoteproc: sysmon: Wait for SSCTL service to come up
  remoteproc: qcom: q6v5: Set q6 state to offline on receiving wdog irq
  remoteproc: qcom: pas: Check if coredump is enabled
  remoteproc: qcom: pas: Mark devices as wakeup capable
  remoteproc: qcom: pas: Mark va as io memory
  remoteproc: qcom: pas: Add decrypt shutdown support for modem
  remoteproc: qcom: q6v5-mss: add powerdomains to MSM8996 config
  remoteproc: qcom_q6v5: Introduce panic handler for MSS
  remoteproc: qcom_q6v5_mss: Update MBA log info
  remoteproc: qcom: correct kerneldoc
  remoteproc: qcom_q6v5_mss: map/unmap metadata region before/after use
  remoteproc: qcom: using pm_runtime_resume_and_get to simplify the code
  remoteproc: mediatek: Support MT8188 SCP
  ...

Merge tag 'rpmsg-v5.20' of git://git./linux/kernel/git/remoteproc/linux

Pull rpmsg updates from Bjorn Andersson:
"This contains fixes and cleanups in the rpmsg core, Qualcomm SMD and
  GLINK drivers, a circular lock dependency in the Mediatek driver and
  a possible race condition in the rpmsg_char driver is resolved"

* tag 'rpmsg-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  rpmsg: convert sysfs snprintf to sysfs_emit
  rpmsg: qcom_smd: Fix refcount leak in qcom_smd_parse_edge
  rpmsg: qcom: correct kerneldoc
  rpmsg: qcom: glink: remove unused name
  rpmsg: qcom: glink: replace strncpy() with strscpy_pad()
  rpmsg: Strcpy is not safe, use strscpy_pad() instead
  rpmsg: Fix possible refcount leak in rpmsg_register_device_override()
  rpmsg: Fix parameter naming for announce_create/destroy ops
  rpmsg: mtk_rpmsg: Fix circular locking dependency
  rpmsg: char: Add mutex protection for rpmsg_eptdev_open()

Merge tag 'linux-watchdog-5.20-rc1' of git://linux-watchdog.org/linux-watchdog

Pull watchdog updates from Wim Van Sebroeck:

- add RTL9310 support

- sp805_wdt: add arm cmsdk apb wdt support

- Remove #ifdef guards for PM related functions for several watchdog
   device drivers

- pm8916_wdt reboot improvements

- Several other fixes and improvements

* tag 'linux-watchdog-5.20-rc1' of git://www.linux-watchdog.org/linux-watchdog: (24 commits)
  watchdog: armada_37xx_wdt: check the return value of devm_ioremap() in armada_37xx_wdt_probe()
  watchdog: dw_wdt: Fix comment typo
  watchdog: Fix comment typo
  dt-bindings: watchdog: Add fsl,scu-wdt yaml file
  watchdog:Fix typo in comment
  watchdog: pm8916_wdt: Handle watchdog enabled by bootloader
  watchdog: pm8916_wdt: Report reboot reason
  watchdog: pm8916_wdt: Avoid read of write-only PET register
  watchdog: wdat_wdt: Remove #ifdef guards for PM related functions
  watchdog: tegra_wdt: Remove #ifdef guards for PM related functions
  watchdog: st_lpc_wdt: Remove #ifdef guards for PM related functions
  watchdog: sama5d4_wdt: Remove #ifdef guards for PM related functions
  watchdog: s3c2410_wdt: Remove #ifdef guards for PM related functions
  watchdog: mtk_wdt: Remove #ifdef guards for PM related functions
  watchdog: dw_wdt: Remove #ifdef guards for PM related functions
  watchdog: bcm7038_wdt: Remove #ifdef guards for PM related functions
  watchdog: realtek-otto: add RTL9310 support
  dt-bindings: watchdog: realtek,otto-wdt: add RTL9310
  watchdog: sp805_wdt: add arm cmsdk apb wdt support
  watchdog: sp5100_tco: Fix a memory leak of EFCH MMIO resource
  ...

Merge tag 'pm-5.20-rc1-2' of git://git./linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
"These are ARM cpufreq updates and operating performance points (OPP)
  updates plus one cpuidle update adding a new trace point.

  Specifics:

   - Fix return error code in mtk_cpu_dvfs_info_init (Yang Yingliang).

   - Minor cleanups and support for new boards for Qcom cpufreq drivers
     (Bryan O'Donoghue, Konrad Dybcio, Pierre Gondois, and Yicong Yang).

   - Fix sparse warnings for Tegra cpufreq driver (Viresh Kumar).

   - Make dev_pm_opp_set_regulators() accept NULL terminated list
     (Viresh Kumar).

   - Add dev_pm_opp_set_config() and friends and migrate other users and
     helpers to using them (Viresh Kumar).

   - Add support for multiple clocks for a device (Viresh Kumar and
     Krzysztof Kozlowski).

   - Configure resources before adding OPP table for Venus (Stanimir
     Varbanov).

   - Keep reference count up for opp->np and opp_table->np while they
     are still in use (Liang He).

   - Minor OPP cleanups (Viresh Kumar and Yang Li).

   - Add a trace event for cpuidle to track missed (too deep or too
     shallow) wakeups (Kajetan Puchalski)"

* tag 'pm-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (55 commits)
  cpuidle: Add cpu_idle_miss trace event
  venus: pm_helpers: Fix warning in OPP during probe
  OPP: Don't drop opp->np reference while it is still in use
  OPP: Don't drop opp_table->np reference while it is still in use
  cpufreq: tegra194: Staticize struct tegra_cpufreq_soc instances
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add SM6375 compatible
  dt-bindings: opp: Add msm8939 to the compatible list
  dt-bindings: opp: Add missing compat devices
  dt-bindings: opp: opp-v2-kryo-cpu: Fix example binding checks
  cpufreq: Change order of online() CB and policy->cpus modification
  cpufreq: qcom-hw: Remove deprecated irq_set_affinity_hint() call
  cpufreq: qcom-hw: Disable LMH irq when disabling policy
  cpufreq: qcom-hw: Reset cancel_throttle when policy is re-enabled
  cpufreq: qcom-cpufreq-hw: use HZ_PER_KHZ macro in units.h
  cpufreq: mediatek: fix error return code in mtk_cpu_dvfs_info_init()
  OPP: Remove dev{m}_pm_opp_of_add_table_noclk()
  PM / devfreq: tegra30: Register config_clks helper
  OPP: Allow config_clks helper for single clk case
  OPP: Provide a simple implementation to configure multiple clocks
  OPP: Assert clk_count == 1 for single clk helpers
  ...

Merge tag 'thermal-5.20-rc1-2' of git://git./linux/kernel/git/rafael/linux-pm

Pull more thermal control updates from Rafael Wysocki:
"These fix an error code path issue leading to a NULL pointer
  dereference, drop Kconfig dependencies that are not needed any more
  after recent changes, add CPU IDs for new chips to a driver and fix up
  the tmon utility.

  Specifics:

   - Fix NULL pointer dereference in the thermal sysfs interface that
     results from an error code path mishandling (Rafael Wysocki).

   - Drop COMPILE_TEST dependency that's not needed any more from two
     thermal Kconfig entries (Jean Delvare).

   - Make the Intel TCC cooling driver support Alder Lake-N and Raptor
     Lake-P (Sumeet Pawnikar).

   - Fix possible path truncations in the tmon utility (Florian
     Fainelli)"

* tag 'thermal-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  tools/thermal: Fix possible path truncations
  thermal: Drop obsolete dependency on COMPILE_TEST
  thermal: sysfs: Fix cooling_device_stats_setup() error code path
  thermal: intel: Add TCC cooling support for Alder Lake-N and Raptor Lake-P

Merge tag 'sysctl-6.0-rc1' of git://git./linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
"There isn't much for 6.0 for sysctl stuff, most of the stuff went
  through the networking subsystem (Kuniyuki Iwashima's trove of fixes
  using READ_ONCE/WRITE_ONCE helpers) as most of the issues there have
  been identified on networking side. So it is good we don't have much
  updates as we would have ended up with tons of conflicts. I rebased my
  delta just now to your tree so to avoid conflicts with that stuff.
  This merge request is just minor fluff cleanups then. Perhaps for 6.1
  kernel/sysctl.c will get more love than this release"

* tag 'sysctl-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  kernel/sysctl.c: Remove trailing white space
  kernel/sysctl.c: Clean up indentation, replace spaces with tab.
  sysctl: Merge adjacent CONFIG_TREE_RCU blocks

Merge tag 'modules-6.0-rc1' of git://git./linux/kernel/git/mcgrof/linux

Pull module updates from Luis Chamberlain:
"For the 6.0 merge window the modules code shifts to cleanup and minor
  fixes effort. This becomes much easier to do and review now due to the
  code split to its own directory from effort on the last kernel
  release. I expect to see more of this with time and as we expand on
  test coverage in the future. The cleanups and fixes come from usual
  suspects such as Christophe Leroy and Aaron Tomlin but there are also
  some other contributors.

  One particular minor fix worth mentioning is from Helge Deller, where
  he spotted a *forever* incorrect natural alignment on both ELF section
  header tables:

    * .altinstructions
    * __bug_table sections

  A lot of back and forth went on in trying to determine the ill effects
  of this misalignment being present for years and it has been
  determined there should be no real ill effects unless you have a buggy
  exception handler. Helge actually hit one of these buggy exception
  handlers on parisc which is how he ended up spotting this issue. When
  implemented correctly these paths with incorrect misalignment would
  just mean a performance penalty, but given that we are dealing with
  alternatives on modules and with the __bug_table (where info regardign
  BUG()/WARN() file/line information associated with it is stored) this
  really shouldn't be a big deal.

  The only other change with mentioning is the kmap() with
  kmap_local_page() and my only concern with that was on what is done
  after preemption, but the virtual addresses are restored after
  preemption. This is only used on module decompression.

  This all has sit on linux-next for a while except the kmap stuff which
  has been there for 3 weeks"

* tag 'modules-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  module: Replace kmap() with kmap_local_page()
  module: Show the last unloaded module's taint flag(s)
  module: Use strscpy() for last_unloaded_module
  module: Modify module_flags() to accept show_state argument
  module: Move module's Kconfig items in kernel/module/
  MAINTAINERS: Update file list for module maintainers
  module: Use vzalloc() instead of vmalloc()/memset(0)
  modules: Ensure natural alignment for .altinstructions and __bug_table sections
  module: Increase readability of module_kallsyms_lookup_name()
  module: Fix ERRORs reported by checkpatch.pl
  module: Add support for default value for module async_probe

NFS: don't unhash dentry during unlink/rename

NFS unlink() (and rename over existing target) must determine if the
file is open, and must perform a "silly rename" instead of an unlink (or
before rename) if it is.  Otherwise the client might hold a file open
which has been removed on the server.

Consequently if it determines that the file isn't open, it must block
any subsequent opens until the unlink/rename has been completed on the
server.

This is currently achieved by unhashing the dentry.  This forces any
open attempt to the slow-path for lookup which will block on i_rwsem on
the directory until the unlink/rename completes.  A future patch will
change the VFS to only get a shared lock on i_rwsem for unlink, so this
will no longer work.

Instead we introduce an explicit interlock.  A special value is stored
in dentry->d_fsdata while the unlink/rename is running and
->d_revalidate blocks while that value is present.  When ->d_revalidate
unblocks, the dentry will be invalid.  This closes the race
without requiring exclusion on i_rwsem.

d_fsdata is already used in two different ways.
1/ an IS_ROOT directory dentry might have a "devname" stored in
   d_fsdata.  Such a dentry doesn't have a name and so cannot be the
   target of unlink or rename.  For safety we check if an old devname
   is still stored, and remove it if it is.
2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct
   nfs_unlinkdata' stored in d_fsdata.  While this is set maydelete()
   will fail, so an unlink or rename will never proceed on such
   a dentry.

Neither of these can be in effect when a dentry is the target of unlink
or rename.  So we can expect d_fsdata to be NULL, and store a special
value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate
that any lookup will be blocked.

The d_count() is incremented under d_lock() when a lookup finds the
dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under
the same lock to avoid any races.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

Merge tag 'leds-5.20-rc1' of git://git./linux/kernel/git/pavel/linux-leds

Pull LED updates from Pavel Machek:
"A new driver for bcm63138, is31fl319x updates, fixups for multicolor.

  The clevo-mail driver got disabled, it needs an API fix"

* tag 'leds-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds: (23 commits)
  leds: is31fl319x: use simple i2c probe function
  leds: is31fl319x: Fix devm vs. non-devm ordering
  leds: is31fl319x: Make use of dev_err_probe()
  leds: is31fl319x: Make use of device properties
  leds: is31fl319x: Cleanup formatting and dev_dbg calls
  leds: is31fl319x: Add support for is31fl319{0,1,3} chips
  leds: is31fl319x: Move chipset-specific values in chipdef struct
  leds: is31fl319x: Use non-wildcard names for vars, structs and defines
  leds: is31fl319x: Add missing si-en compatibles
  dt-bindings: leds: pwm-multicolor: document max-brigthness
  leds: turris-omnia: convert to use dev_groups
  leds: leds-bcm63138: get rid of LED_OFF
  leds: add help info about BCM63138 module name
  dt-bindings: leds: leds-bcm63138: unify full stops in descriptions
  dt-bindings: leds: lp50xx: fix LED children names
  dt-bindings: leds: class-multicolor: reference class directly in multi-led node
  leds: bcm63138: add support for BCM63138 controller
  dt-bindings: leds: add Broadcom's BCM63138 controller
  leds: clevo-mail: Mark as broken pending interface fix
  leds: pwm-multicolor: Support active-low LEDs
  ...

Merge tag 'tty-6.0-rc1' of git://git./linux/kernel/git/gregkh/tty

Pull tty / serial driver updates from Greg KH:
"Here is the big set of tty and serial driver changes for 6.0-rc1.

  It was delayed from last week as I wanted to make sure the last commit
  here got some good testing in linux-next and elsewhere as it seemed to
  show up only late in testing for some reason.

  Nothing major here, just lots of cleanups from Jiri and Ilpo to make
  the tty core cleaner (Jiri) and the rs485 code simpler to use (Ilpo).

  Also included in here is the obligatory n_gsm updates from Daniel
  Starke and lots of tiny driver updates and minor fixes and tweaks for
  other smaller serial drivers.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'tty-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (186 commits)
  tty: serial: qcom-geni-serial: Fix %lu -> %u in print statements
  tty: amiserial: Fix comment typo
  tty: serial: document uart_get_console()
  tty: serial: serial_core, reformat kernel-doc for functions
  Documentation: serial: link uart_ops properly
  Documentation: serial: move GPIO kernel-doc to the functions
  Documentation: serial: dedup kernel-doc for uart functions
  Documentation: serial: move uart_ops documentation to the struct
  dt-bindings: serial: snps-dw-apb-uart: Document Rockchip RV1126
  serial: mvebu-uart: uart2 error bits clearing
  tty: serial: fsl_lpuart: correct the count of break characters
  serial: stm32: make info structs static to avoid sparse warnings
  serial: fsl_lpuart: zero out parity bit in CS7 mode
  tty: serial: qcom-geni-serial: Fix get_clk_div_rate() which otherwise could return a sub-optimal clock rate.
  serial: 8250_bcm2835aux: Add missing clk_disable_unprepare()
  tty: vt: initialize unicode screen buffer
  serial: remove VR41XX serial driver
  serial: 8250: lpc18xx: Remove redundant sanity check for RS485 flags
  serial: 8250_dwlib: remove redundant sanity check for RS485 flags
  dt_bindings: rs485: Correct delay values
  ...

Merge tag 'f2fs-for-6.0' of git://git./linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we mainly fixed some corner cases that manipulate a
  per-file compression flag inappropriately. And, we found f2fs counted
  valid blocks in a section incorrectly when zone capacity is set, and
  thus, fixed it with additional sysfs entry to check it easily.

  Lastly, this series includes several patches with respect to the new
  atomic write support such as a couple of bug fixes and re-adding
  atomic_write_abort support that we removed by mistake in the previous
  release.

  Enhancements:
   - add sysfs entries to understand atomic write operations and zone
     capacity
   - introduce memory mode to get a hint for low-memory devices
   - adjust the waiting time of foreground GC
   - decompress clusters under softirq to avoid non-deterministic
     latency
   - do not skip updating inode when retrying to flush node page
   - enforce single zone capacity

  Bug fixes:
   - set the compression/no-compression flags correctly
   - revive F2FS_IOC_ABORT_VOLATILE_WRITE
   - check inline_data during compressed inode conversion
   - understand zone capacity when calculating valid block count

  As usual, the series includes several minor clean-ups and sanity
  checks"

* tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
  f2fs: use onstack pages instead of pvec
  f2fs: intorduce f2fs_all_cluster_page_ready
  f2fs: clean up f2fs_abort_atomic_write()
  f2fs: handle decompress only post processing in softirq
  f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED
  f2fs: do not set compression bit if kernel doesn't support
  f2fs: remove device type check for direct IO
  f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data
  f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE
  f2fs: fix to do sanity check on segment type in build_sit_entries()
  f2fs: obsolete unused MAX_DISCARD_BLOCKS
  f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
  f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
  f2fs: introduce sysfs atomic write statistics
  f2fs: don't bother wait_ms by foreground gc
  f2fs: invalidate meta pages only for post_read required inode
  f2fs: allow compression of files without blocks
  f2fs: fix to check inline_data during compressed inode conversion
  f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
  f2fs: fix to invalidate META_MAPPING before DIO write
  ...

Merge tag 'fuse-update-6.0' of git://git./linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

- Fix an issue with reusing the bdi in case of block based filesystems

- Allow root (in init namespace) to access fuse filesystems in user
   namespaces if expicitly enabled with a module param

- Misc fixes

* tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: retire block-device-based superblock on force unmount
  vfs: function to prevent re-use of block-device-based superblocks
  virtio_fs: Modify format for virtio_fs_direct_access
  virtiofs: delete unused parameter for virtio_fs_cleanup_vqs
  fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other
  fuse: Remove the control interface for virtio-fs
  fuse: ioctl: translate ENOSYS
  fuse: limit nsec
  fuse: avoid unnecessary spinlock bump
  fuse: fix deadlock between atomic O_TRUNC and page invalidation
  fuse: write inode in fuse_release()

Merge tag 'ovl-update-6.0' of git://git./linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:
"Just a small update"

* tag 'ovl-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix spelling mistakes
  ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh()
  ovl: improve ovl_get_acl() if POSIX ACL support is off
  ovl: fix some kernel-doc comments
  ovl: warn if trusted xattr creation fails

Merge tag 'exfat-for-5.20-rc1' of git://git./linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

- fix the error code of rename syscall

- cleanup and suppress the superfluous error messages

- remove duplicate directory entry update

- add exfat git tree in MAINTAINERS

* tag 'exfat-for-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  MAINTAINERS: Add Namjae's exfat git tree
  exfat: Drop superfluous new line for error messages
  exfat: Downgrade ENAMETOOLONG error message to debug messages
  exfat: Expand exfat_err() and co directly to pr_*() macro
  exfat: Define NLS_NAME_* as bit flags explicitly
  exfat: Return ENAMETOOLONG consistently for oversized paths
  exfat: remove duplicate write inode for extending dir/file
  exfat: remove duplicate write inode for truncating file
  exfat: reuse __exfat_write_inode() to update directory entry

vfs: Check the truncate maximum size in inode_newsize_ok()

If something manages to set the maximum file size to MAX_OFFSET+1, this
can cause the xfs and ext4 filesystems at least to become corrupt.

Ordinarily, the kernel protects against userspace trying this by
checking the value early in the truncate() and ftruncate() system calls
calls - but there are at least two places that this check is bypassed:

(1) Cachefiles will round up the EOF of the backing file to DIO block
     size so as to allow DIO on the final block - but this might push
     the offset negative. It then calls notify_change(), but this
     inadvertently bypasses the checking. This can be triggered if
     someone puts an 8EiB-1 file on a server for someone else to try and
     access by, say, nfs.

(2) ksmbd doesn't check the value it is given in set_end_of_file_info()
     and then calls vfs_truncate() directly - which also bypasses the
     check.

In both cases, it is potentially possible for a network filesystem to
cause a disk filesystem to be corrupted: cachefiles in the client's
cache filesystem; ksmbd in the server's filesystem.

nfsd is okay as it checks the value, but we can then remove this check
too.

Fix this by adding a check to inode_newsize_ok(), as called from
setattr_prepare(), thereby catching the issue as filesystems set up to
perform the truncate with minimal opportunity for bypassing the new
check.

Fixes: 1f08c925e7a3 ("cachefiles: Implement backing file wrangling")
Fixes: f44158485826 ("cifsd: add file operations")
Signed-off-by: David Howells <dhowells@redhat.com>
Reported-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Cc: stable@kernel.org
Acked-by: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Steve French <sfrench@samba.org>
cc: Hyunchul Lee <hyc.lee@gmail.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>