review.tizen.org Git - platform/kernel/linux-rpi.git/log

Merge tag 'fscrypt-for-linus' of git://git./fs/fscrypt/linux

Pull fscrypt update from Eric Biggers:
"Just a small documentation improvement"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: improve the "Encryption modes and usage" section

Merge tag 'iomap-6.6-merge-3' of git://git./fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
"We've got some big changes for this release -- I'm very happy to be
  landing willy's work to enable large folios for the page cache for
  general read and write IOs when the fs can make contiguous space
  allocations, and Ritesh's work to track sub-folio dirty state to
  eliminate the write amplification problems inherent in using large
  folios.

  As a bonus, io_uring can now process write completions in the caller's
  context instead of bouncing through a workqueue, which should reduce
  io latency dramatically. IOWs, XFS should see a nice performance bump
  for both IO paths.

  Summary:

   - Make large writes to the page cache fill sparse parts of the cache
     with large folios, then use large memcpy calls for the large folio.

   - Track the per-block dirty state of each large folio so that a
     buffered write to a single byte on a large folio does not result in
     a (potentially) multi-megabyte writeback IO.

   - Allow some directio completions to be performed in the initiating
     task's context instead of punting through a workqueue. This will
     reduce latency for some io_uring requests"

* tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
  iomap: support IOCB_DIO_CALLER_COMP
  io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
  fs: add IOCB flags related to passing back dio completions
  iomap: add IOMAP_DIO_INLINE_COMP
  iomap: only set iocb->private for polled bio
  iomap: treat a write through cache the same as FUA
  iomap: use an unsigned type for IOMAP_DIO_* defines
  iomap: cleanup up iomap_dio_bio_end_io()
  iomap: Add per-block dirty state tracking to improve performance
  iomap: Allocate ifs in ->write_begin() early
  iomap: Refactor iomap_write_delalloc_punch() function out
  iomap: Use iomap_punch_t typedef
  iomap: Fix possible overflow condition in iomap_write_delalloc_scan
  iomap: Add some uptodate state handling helpers for ifs state bitmap
  iomap: Drop ifs argument from iomap_set_range_uptodate()
  iomap: Rename iomap_page to iomap_folio_state and others
  iomap: Copy larger chunks from userspace
  iomap: Create large folios in the buffered write path
  filemap: Allow __filemap_get_folio to allocate large folios
  filemap: Add fgf_t typedef
  ...

Merge tag 'erofs-for-6.6-rc1' of git://git./linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
"In this cycle, a xattr bloom filter feature is introduced to speed up
  negative xattr lookups, which was originally suggested by Alexander
  for Composefs use cases.

  Additionally, the DEFLATE algorithm is now supported, which can be
  used together with hardware accelerators for our cloud workloads. Each
  supported compression algorithm can be selected on a per-file basis
  for specific access patterns too.

  There are also some random fixes and cleanups as usual:

   - Support xattr bloom filter to optimize negative xattr lookups

   - Support DEFLATE compression algorithm as an alternative

   - Fix a regression that ztailpacking pclusters don't release properly

   - Avoid warning dedupe and fragments features anymore

   - Some folio conversions and cleanups"

* tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: release ztailpacking pclusters properly
  erofs: don't warn dedupe and fragments features anymore
  erofs: adapt folios for z_erofs_read_folio()
  erofs: adapt folios for z_erofs_readahead()
  erofs: get rid of fe->backmost for cache decompression
  erofs: drop z_erofs_page_mark_eio()
  erofs: tidy up z_erofs_do_read_page()
  erofs: move preparation logic into z_erofs_pcluster_begin()
  erofs: avoid obsolete {collector,collection} terms
  erofs: simplify z_erofs_read_fragment()
  erofs: remove redundant erofs_fs_type declaration in super.c
  erofs: add necessary kmem_cache_create flags for erofs inode cache
  erofs: clean up redundant comment and adjust code alignment
  erofs: refine warning messages for zdata I/Os
  erofs: boost negative xattr lookup with bloom filter
  erofs: update on-disk format for xattr name filter
  erofs: DEFLATE compression support

Merge tag 'filelock-v6.6' of git://git./linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:

- new functionality for F_OFD_GETLK: requesting a type of F_UNLCK will
   find info about whatever lock happens to be first in the given range,
   regardless of type.

- an OFD lock selftest

- bugfix involving a UAF in a tracepoint

- comment typo fix

* tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock
  fs/locks: Fix typo
  selftests: add OFD lock tests
  fs/locks: F_UNLCK extension for F_OFD_GETLK

Merge tag 'v6.6-fs.proc.uapi' of git://git./linux/kernel/git/vfs/vfs

Pull procfs fixes from Christian Brauner:
"Mode changes to files under /proc/<pid>/ aren't supported ever since
  commit 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files").

  Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread
  cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
  mode changes on /proc/thread-self/comm were accidently allowed.

  Similar, mode changes for all files beneath /proc/<pid>/net/ are
  blocked but mode changes on /proc/<pid>/net itself were accidently
  allowed.

  Both issues come down to not using the generic proc_setattr() helper
  which blocks all mode changes. This is rectified with this pull
  request.

  This also removes a strange nolibc test that abused /proc/<pid>/net
  for testing mode changes. Using procfs for this test never made a lot
  of sense given procfs has special semantics for almost everything
  anway.

  Both changes are minor user-visible changes. It is however very
  unlikely that mode changes on proc/<pid>/net and
  /proc/thread-self/comm are something that userspace relies on"

* tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  procfs: block chmod on /proc/thread-self/comm
  proc: use generic setattr() for /proc/$PID/net
  selftests/nolibc: drop test chmod_net

Merge tag 'v6.6-vfs.autofs' of git://git./linux/kernel/git/vfs/vfs

Pull autofs fixes from Christian Brauner:
"This fixes a memory leak in autofs reported by syzkaller and a missing
  conversion from uninterruptible to interruptible wake up when autofs
  is in catatonic mode"

* tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  autofs: use wake_up() instead of wake_up_interruptible(()
  autofs: fix memory leak of waitqueues in autofs_catatonic_mode

Merge tag 'v6.6-vfs.fchmodat2' of git://git./linux/kernel/git/vfs/vfs

Pull fchmodat2 system call from Christian Brauner:
"This adds the fchmodat2() system call. It is a revised version of the
  fchmodat() system call, adding a missing flag argument. Support for
  both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included.

  Adding this system call revision has been a longstanding request but
  so far has always fallen through the cracks. While the kernel
  implementation of fchmodat() does not have a flag argument the libc
  provided POSIX-compliant fchmodat(3) version does. Both glibc and musl
  have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW
  (see [1] and [2]).

  The workaround is brittle because it relies not just on O_PATH and
  O_NOFOLLOW semantics and procfs magic links but also on our rather
  inconsistent symlink semantics.

  This gives userspace a proper fchmodat2() system call that libcs can
  use to properly implement fchmodat(3) and allows them to get rid of
  their hacks. In this case it will immediately benefit them as the
  current workaround is already defunct because of aformentioned
  inconsistencies.

  In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use
  AT_EMPTY_PATH with fchmodat2(). This is already possible with
  fchownat() so there's no reason to not also support it for
  fchmodat2().

  The implementation is simple and comes with selftests. Implementation
  of the system call and wiring up the system call are done as separate
  patches even though they could arguably be one patch. But in case
  there are merge conflicts from other system call additions it can be
  beneficial to have separate patches"

Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
* tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: fchmodat2: remove duplicate unneeded defines
  fchmodat2: add support for AT_EMPTY_PATH
  selftests: Add fchmodat2 selftest
  arch: Register fchmodat2, usually as syscall 452
  fs: Add fchmodat2()
  Non-functional cleanup of a "__user * filename"

Merge tag 'v6.6-vfs.super' of git://git./linux/kernel/git/vfs/vfs

Pull superblock updates from Christian Brauner:
"This contains the super rework that was ready for this cycle. The
  first part changes the order of how we open block devices and allocate
  superblocks, contains various cleanups, simplifications, and a new
  mechanism to wait on superblock state changes.

  This unblocks work to ultimately limit the number of writers to a
  block device. Jan has already scheduled follow-up work that will be
  ready for v6.7 and allows us to restrict the number of writers to a
  given block device. That series builds on this work right here.

  The second part contains filesystem freezing updates.

  Overview:

  The generic superblock changes are rougly organized as follows
  (ignoring additional minor cleanups):

   (1) Removal of the bd_super member from struct block_device.

       This was a very odd back pointer to struct super_block with
       unclear rules. For all relevant places we have other means to get
       the same information so just get rid of this.

   (2) Simplify rules for superblock cleanup.

       Roughly, everything that is allocated during fs_context
       initialization and that's stored in fs_context->s_fs_info needs
       to be cleaned up by the fs_context->free() implementation before
       the superblock allocation function has been called successfully.

       After sget_fc() returned fs_context->s_fs_info has been
       transferred to sb->s_fs_info at which point sb->kill_sb() if
       fully responsible for cleanup. Adhering to these rules means that
       cleanup of sb->s_fs_info in fill_super() is to be avoided as it's
       brittle and inconsistent.

       Cleanup shouldn't be duplicated between sb->put_super() as
       sb->put_super() is only called if sb->s_root has been set aka
       when the filesystem has been successfully born (SB_BORN). That
       complexity should be avoided.

       This also means that block devices are to be closed in
       sb->kill_sb() instead of sb->put_super(). More details in the
       lower section.

   (3) Make it possible to lookup or create a superblock before opening
       block devices

       There's a subtle dependency on (2) as some filesystems did rely
       on fill_super() to be called in order to correctly clean up
       sb->s_fs_info. All these filesystems have been fixed.

   (4) Switch most filesystem to follow the same logic as the generic
       mount code now does as outlined in (3).

   (5) Use the superblock as the holder of the block device. We can now
       easily go back from block device to owning superblock.

   (6) Export and extend the generic fs_holder_ops and use them as
       holder ops everywhere and remove the filesystem specific holder
       ops.

   (7) Call from the block layer up into the filesystem layer when the
       block device is removed, allowing to shut down the filesystem
       without risk of deadlocks.

   (8) Get rid of get_super().

       We can now easily go back from the block device to owning
       superblock and can call up from the block layer into the
       filesystem layer when the device is removed. So no need to wade
       through all registered superblock to find the owning superblock
       anymore"

Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/
* tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits)
  super: use higher-level helper for {freeze,thaw}
  super: wait until we passed kill super
  super: wait for nascent superblocks
  super: make locking naming consistent
  super: use locking helpers
  fs: simplify invalidate_inodes
  fs: remove get_super
  block: call into the file system for ioctl BLKFLSBUF
  block: call into the file system for bdev_mark_dead
  block: consolidate __invalidate_device and fsync_bdev
  block: drop the "busy inodes on changed media" log message
  dasd: also call __invalidate_device when setting the device offline
  amiflop: don't call fsync_bdev in FDFMTBEG
  floppy: call disk_force_media_change when changing the format
  block: simplify the disk_force_media_change interface
  nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
  xfs use fs_holder_ops for the log and RT devices
  xfs: drop s_umount over opening the log and RT devices
  ext4: use fs_holder_ops for the log device
  ext4: drop s_umount over opening the log device
  ...

Merge tag 'v6.6-vfs.misc' of git://git./linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual filesystems.

  Features:

   - Block mode changes on symlinks and rectify our broken semantics

   - Report file modifications via fsnotify() for splice

   - Allow specifying an explicit timeout for the "rootwait" kernel
     command line option. This allows to timeout and reboot instead of
     always waiting indefinitely for the root device to show up

   - Use synchronous fput for the close system call

  Cleanups:

   - Get rid of open-coded lockdep workarounds for async io submitters
     and replace it all with a single consolidated helper

   - Simplify epoll allocation helper

   - Convert simple_write_begin and simple_write_end to use a folio

   - Convert page_cache_pipe_buf_confirm() to use a folio

   - Simplify __range_close to avoid pointless locking

   - Disable per-cpu buffer head cache for isolated cpus

   - Port ecryptfs to kmap_local_page() api

   - Remove redundant initialization of pointer buf in pipe code

   - Unexport the d_genocide() function which is only used within core
     vfs

   - Replace printk(KERN_ERR) and WARN_ON() with WARN()

  Fixes:

   - Fix various kernel-doc issues

   - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE

   - Fix a mainly theoretical issue in devpts

   - Check the return value of __getblk() in reiserfs

   - Fix a racy assert in i_readcount_dec

   - Fix integer conversion issues in various functions

   - Fix LSM security context handling during automounts that prevented
     NFS superblock sharing"

* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  cachefiles: use kiocb_{start,end}_write() helpers
  ovl: use kiocb_{start,end}_write() helpers
  aio: use kiocb_{start,end}_write() helpers
  io_uring: use kiocb_{start,end}_write() helpers
  fs: create kiocb_{start,end}_write() helpers
  fs: add kerneldoc to file_{start,end}_write() helpers
  io_uring: rename kiocb_end_write() local helper
  splice: Convert page_cache_pipe_buf_confirm() to use a folio
  libfs: Convert simple_write_begin and simple_write_end to use a folio
  fs/dcache: Replace printk and WARN_ON by WARN
  fs/pipe: remove redundant initialization of pointer buf
  fs: Fix kernel-doc warnings
  devpts: Fix kernel-doc warnings
  doc: idmappings: fix an error and rephrase a paragraph
  init: Add support for rootwait timeout parameter
  vfs: fix up the assert in i_readcount_dec
  fs: Fix one kernel-doc comment
  docs: filesystems: idmappings: clarify from where idmappings are taken
  fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
  vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
  ...

Merge tag 'v6.6-vfs.tmpfs' of git://git./linux/kernel/git/vfs/vfs

Pull libfs and tmpfs updates from Christian Brauner:
"This cycle saw a lot of work for tmpfs that required changes to the
  vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this
  cycle. Things will go back to mm next cycle.

  Features
  ========

   - By far the biggest work is the quota support for tmpfs. New tmpfs
     quota infrastructure is added to support it and a new QFMT_SHMEM
     uapi option is exposed.

     This offers user and group quotas to tmpfs (project quotas will be
     added later). Similar to other filesystems tmpfs quota are not
     supported within user namespaces yet.

   - Add support for user xattrs. While tmpfs already supports security
     xattrs (security.*) and POSIX ACLs for a long time it lacked
     support for user xattrs (user.*). With this pull request tmpfs will
     be able to support a limited number of user xattrs.

     This is accompanied by a fix (see below) to limit persistent simple
     xattr allocations.

   - Add support for stable directory offsets. Currently tmpfs relies on
     the libfs provided cursor-based mechanism for readdir. This causes
     issues when a tmpfs filesystem is exported via NFS.

     NFS clients do not open directories. Instead, each server-side
     readdir operation opens the directory, reads it, and then closes
     it. Since the cursor state for that directory is associated with
     the opened file it is discarded after each readdir operation. Such
     directory offsets are not just cached by NFS clients but also
     various userspace libraries based on these clients.

     As it stands there is no way to invalidate the caches when
     directory offsets have changed and the whole application depends on
     unchanging directory offsets.

     At LSFMM we discussed how to solve this problem and decided to
     support stable directory offsets. libfs now allows filesystems like
     tmpfs to use an xarrary to map a directory offset to a dentry. This
     mechanism is currently only used by tmpfs but can be supported by
     others as well.

  Fixes
  =====

   - Change persistent simple xattrs allocations in libfs from
     GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory
     cgroup limits. Since this is a change to libfs it affects both
     tmpfs and kernfs.

   - Correctly verify {g,u}id mount options.

     A new filesystem context is created via fsopen() which records the
     namespace that becomes the owning namespace of the superblock when
     fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are
     mountable in namespaces. However, fsconfig() calls can occur in a
     namespace different from the namespace where fsopen() has been
     called.

     Currently, when fsconfig() is called to set {g,u}id mount options
     the requested {g,u}id is mapped into a k{g,u}id according to the
     namespace where fsconfig() was called from. The resulting k{g,u}id
     is not guaranteed to be resolvable in the namespace of the
     filesystem (the one that fsopen() was called in).

     This means it's possible for an unprivileged user to create files
     owned by any group in a tmpfs mount since it's possible to set the
     setid bits on the tmpfs directory.

     The contract for {g,u}id mount options and {g,u}id values in
     general set from userspace has always been that they are translated
     according to the caller's idmapping. In so far, tmpfs has been
     doing the correct thing. But since tmpfs is mountable in
     unprivileged contexts it is also necessary to verify that the
     resulting {k,g}uid is representable in the namespace of the
     superblock to avoid such bugs.

     The new mount api's cross-namespace delegation abilities are
     already widely used. Having talked to a bunch of userspace this is
     the most faithful solution with minimal regression risks"

* tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs
  mm: invalidation check mapping before folio_contains
  tmpfs: trivial support for direct IO
  tmpfs,xattr: enable limited user extended attributes
  tmpfs: track free_ispace instead of free_inodes
  xattr: simple_xattr_set() return old_xattr to be freed
  tmpfs: verify {g,u}id mount options correctly
  shmem: move spinlock into shmem_recalc_inode() to fix quota support
  libfs: Remove parent dentry locking in offset_iterate_dir()
  libfs: Add a lock class for the offset map's xa_lock
  shmem: stable directory offsets
  shmem: Refactor shmem_symlink()
  libfs: Add directory operations for stable offsets
  shmem: fix quota lock nesting in huge hole handling
  shmem: Add default quota limit mount options
  shmem: quota support
  shmem: prepare shmem quota infrastructure
  quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks
  shmem: make shmem_get_inode() return ERR_PTR instead of NULL
  shmem: make shmem_inode_acct_block() return error

Merge tag 'v6.6-vfs.ctime' of git://git./linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...

parisc: ccio-dma: Create private runway procfs root entry

Create an own procfs "runway" root entry for the CCIO driver.
No need to share it with the sba_iommu driver, as only one
of those busses can be active in one machine anyway.

Signed-off-by: Helge Deller <deller@gmx.de>
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 547259580dfa ("parisc: Move proc_mckinley_root and proc_runway_root to sba_iommu")
Cc: <stable@vger.kernel.org> # v6.5

Merge tag 'v6.6-vfs.fs_context' of git://git./linux/kernel/git/vfs/vfs

Pull mount API updates from Christian Brauner:
"This introduces FSCONFIG_CMD_CREATE_EXCL which allows userspace to
  implement something like

      $ mount -t ext4 --exclusive /dev/sda /B

  which fails if a superblock for the requested filesystem does already
  exist instead of silently reusing an existing superblock.

  Without it, in the sequence

      $ move-mount -f xfs -o       source=/dev/sda4 /A
      $ move-mount -f xfs -o noacl,source=/dev/sda4 /B

  the initial mounter will create a superblock. The second mounter will
  reuse the existing superblock, creating a bind-mount (see [1] for the
  source of the move-mount binary).

  The problem is that reusing an existing superblock means all mount
  options other than read-only and read-write will be silently ignored
  even if they are incompatible requests. For example, the second mount
  has requested no POSIX ACL support but since the existing superblock
  is reused POSIX ACL support will remain enabled.

  Such silent superblock reuse can easily become a security issue.

  After adding support for FSCONFIG_CMD_CREATE_EXCL to mount(8) in
  util-linux this can be fixed:

      $ move-mount -f xfs --exclusive -o       source=/dev/sda4 /A
      $ move-mount -f xfs --exclusive -o noacl,source=/dev/sda4 /B
      Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed

  This requires the new mount api. With the old mount api it would be
  necessary to plumb this through every legacy filesystem's
  file_system_type->mount() method. If they want this feature they are
  most welcome to switch to the new mount api"

Link: https://github.com/brauner/move-mount-beneath
Link: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
Link: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
Link: https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner
Link: https://lore.kernel.org/lkml/20230824-anzog-allheilmittel-e8c63e429a79@brauner/
* tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: add FSCONFIG_CMD_CREATE_EXCL
  fs: add vfs_cmd_reconfigure()
  fs: add vfs_cmd_create()
  super: remove get_tree_single_reconf()

parisc: chassis: Do not overwrite string on LCD display

If we send a chassis code via PDC, PDC usually overwrites the
contents on the LCD display. Just call lcd_print() in this case
so that the LCD/LED driver prints the last string again.

Signed-off-by: Helge Deller <deller@gmx.de>

parisc: led: Rewrite LED/LCD driver to utilizize Linux LED subsystem

Rewrite the whole driver and drop the own code to calculate load
average, disk and LAN load. Switch instead to use the in-kernel LED
subsystem, which gives us quite some advantages, e.g.
- existing triggers for heartbeat and disk/lan activity can be used
- users can configre the LEDs at will to any existing trigger via
/sys/class/leds
- less overhead since we don't need to run own timers
- fully integrated in Linux and as such cleaner code.

Note that the driver now depends on CONFIG_LEDS_CLASS which has to
be built-in and not as module.

Signed-off-by: Helge Deller <deller@gmx.de>

dt-bindings: thermal: lmh: update maintainer address

The old email is no longer functioning.

Fixes: 17b1362d4919 ("MAINTAINERS: Update email address")
Signed-off-by: David Heidelberg <david@ixit.cz>
Link: https://lore.kernel.org/r/20230823223622.91789-1-david@ixit.cz
Signed-off-by: Rob Herring <robh@kernel.org>

of: unittest: Fix of_unittest_pci_node() kconfig dependencies

of_unittest_pci_node test depends on both CONFIG_PCI_DYNAMIC_OF_NODES
and CONFIG_OF_OVERLAY. Move the test into the existing
CONFIG_OF_OVERLAY ifdef and rework the CONFIG_PCI_DYNAMIC_OF_NODES
dependency to use IS_ENABLED() instead. This reduces the combinations to
build.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308241954.oRNfVqmB-lkp@intel.com/
Fixes: 26409dd04589 ("of: unittest: Add pci_dt_testdrv pci driver")
Cc: Lizhi Hou <lizhi.hou@amd.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230824221743.1581707-1-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>

Merge branch 'devlink-finish-file-split-and-get-retire-leftover-c'

Jiri Pirko says:

====================
devlink: finish file split and get retire leftover.c

This patchset finishes a move Jakub started and Moshe continued in the
past. I was planning to do this for a long time, so here it is, finally.

This patchset does not change any behaviour. It just splits leftover.c
into per-object files and do necessary changes, like declaring functions
used from other code, on the way.

The last 3 patches are pushing the rest of the code into appropriate
existing files.
====================

Link: https://lore.kernel.org/r/20230828061657.300667-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: move devlink_notify_register/unregister() to dev.c

At last, move the last bits out of leftover.c,
the devlink_notify_register/unregister() functions to dev.c

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-16-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: move small_ops definition into netlink.c

Move the generic netlink small_ops definition where they are consumed,
into netlink.c

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-15-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: move tracepoint definitions into core.c

Move remaining tracepoint definitions to most suitable file core.c.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push linecard related code into separate file

Cut out another chunk from leftover.c and put linecard related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push rate related code into separate file

Cut out another chunk from leftover.c and put rate related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push trap related code into separate file

Cut out another chunk from leftover.c and put trap related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: use tracepoint_enabled() helper

In preparation for the trap code move, use tracepoint_enabled() helper
instead of trace_devlink_trap_report_enabled() which would not be
defined in that scope.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push region related code into separate file

Cut out another chunk from leftover.c and put region related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push param related code into separate file

Cut out another chunk from leftover.c and put param related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push resource related code into separate file

Cut out another chunk from leftover.c and put resource related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push dpipe related code into separate file

Cut out another chunk from leftover.c and put dpipe related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper

Since both dpipe and resource code is using this helper, in preparation
for code split to separate files, move
devlink_dpipe_send_and_alloc_skb() helper into netlink.c. Rename it on
the way.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push shared buffer related code into separate file

Cut out another chunk from leftover.c and put sb related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push port related code into separate file

Cut out another chunk from leftover.c and put port related code
into a separate file.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: push object register/unregister notifications into separate helpers

In preparations of leftover.c split to individual files, avoid need to
have object structures exposed in devl_internal.h and allow to have them
maintained in object files.

The register/unregister notifications need to know the structures
to iterate lists. To avoid the need, introduce per-object
register/unregister notification helpers and use them.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'asoc-fix-v6.5-merge-window' of https://git./linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes that got left after v6.4

These were some changes in my v6.4 branch that never got sent as fixes,
none of them super urgent thankfully.

ASoC: dwc: i2s: Fix unused functions

A few newly added functions aren't built unless CONFIG_OF is set,
which result in the build failure due to defined-but-not-used errors.

Put "#ifdef CONFIG_OF" around those functions to suppress the build
error.

Fixes: 52ea7c0543f8 ("ASoC: dwc: i2s: Add StarFive JH7110 SoC support")
Link: https://lore.kernel.org/r/20230828113537.27600-1-tiwai@suse.de
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Merge tag 'asoc-v6.6' of https://git./linux/kernel/git/broonie/sound into for-linus

ASoC: Updates for v6.6

The rest of the updates for v6.6, some of the highlights include:

- A big API cleanup from Morimoto-san, rationalising the places we put
   functions.
- Lots of work on the SOF framework, AMD and Intel drivers, including a
   lot of cleanup and new device support.
- Standardisation of the presentation of jacks from drivers.
- Provision of some generic sound card DT properties.
- Conversion oof more drivers to the maple tree register cache.
- New drivers for AMD Van Gogh, AWInic AW88261, Cirrus Logic cs42l43,
   various Intel platforms, Mediatek MT7986, RealTek RT1017 and StarFive
   JH7110.

ALSA: usb-audio: Don't try to submit URBs after disconnection

USB-audio driver can still submit URBs while the device is being
disconnected, and it may result in spurious error messages like:
  usb 1-2: cannot submit urb (err = -19)
  usb 1-2: Unable to submit urb #0: -19 at snd_usb_queue_pending_output_urbs
  usb 1-2: cannot submit urb 0, error -19: no device
Although those are harmless, they are just ugly.

This patch tries to avoid spewing such error messages when the device
is already at the disconnected state.  It also skips the superfluous
xfer notification, too.

Link: https://lore.kernel.org/r/20230828101924.27107-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Merge tag 'opp-updates-6.6' of git://git./linux/kernel/git/vireshk/pm

Pull OPP updates for 6.6 from Viresh Kumar:

"- Minor core cleanup and addition of new frequency related APIs (Viresh
   Kumar and Manivannan Sadhasivam).

- Convert ti cpufreq/opp bindings to json schema (Nishanth Menon)."

* tag 'opp-updates-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  dt-bindings: cpufreq: Convert ti-cpufreq to json schema
  dt-bindings: opp: Convert ti-omap5-opp-supply to json schema
  OPP: Fix argument name in doc comment
  dt-bindings: opp: Increase maxItems for opp-hz property
  OPP: Fix passing 0 to PTR_ERR in _opp_attach_genpd()
  OPP: Fix potential null ptr dereference in dev_pm_opp_get_required_pstate()
  OPP: Reuse dev_pm_opp_get_freq_indexed()
  OPP: Update _read_freq() to return the correct frequency
  OPP: Add dev_pm_opp_find_freq_exact_indexed()
  OPP: Introduce dev_pm_opp_get_freq_indexed() API
  OPP: Introduce dev_pm_opp_find_freq_{ceil/floor}_indexed() APIs
  OPP: Rearrange entries in pm_opp.h

Merge branch 'pm-cpufreq'

Merge ARM cpufreq updates for 6.6:

- Migrate various platforms to use remove callback returning void
   (Yangtao Li).

- Add online/offline/exit hooks for Tegra driver (Sumit Gupta).

- Explicitly include correct DT includes (Rob Herring).

- Frequency domain updates for qcom-hw driver (Neil Armstrong).

- Modify AMD pstate driver return the highest_perf value (Meng Li).

- Generic cleanups for cppc, mediatek and powernow driver (Liao Chang,
   Konrad Dybcio).

- Add more platforms to cpufreq-arm driver's blocklist (AngeloGioacchino
   Del Regno, Konrad Dybcio).

- brcmstb-avs-cpufreq: Fix -Warray-bounds bug (Gustavo A. R. Silva).

* pm-cpufreq: (33 commits)
  cpufreq: tegra194: remove opp table in exit hook
  cpufreq: powernow-k8: Use related_cpus instead of cpus in driver.exit()
  cpufreq: tegra194: add online/offline hooks
  cpufreq: qcom-cpufreq-hw: add support for 4 freq domains
  dt-bindings: cpufreq: qcom-hw: add a 4th frequency domain
  cpufreq: cppc: Set fie_disabled to FIE_DISABLED if fails to create kworker_fie
  cpufreq: cppc: cppc_cpufreq_get_rate() returns zero in all error cases.
  cpufreq: Prefer to print cpuid in MIN/MAX QoS register error message
  cpufreq: amd-pstate-ut: Modify the function to get the highest_perf value
  cpufreq: mediatek-hw: Remove unused define
  cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
  cpufreq: brcmstb-avs-cpufreq: Fix -Warray-bounds bug
  cpufreq: blocklist MSM8998 in cpufreq-dt-platdev
  cpufreq: omap: Convert to platform remove callback returning void
  cpufreq: qoriq: Convert to platform remove callback returning void
  cpufreq: acpi: Convert to platform remove callback returning void
  cpufreq: tegra186: Convert to platform remove callback returning void
  cpufreq: qcom-nvmem: Convert to platform remove callback returning void
  cpufreq: kirkwood: Convert to platform remove callback returning void
  cpufreq: pcc-cpufreq: Convert to platform remove callback returning void
  ...

Merge tag 'cpufreq-arm-updates-6.6' of git://git./linux/kernel/git/vireshk/pm

Pull ARM cpufreq updates for 6.6 from Viresh Kumar:

"- Migrate various platforms to use remove callback returning void
   (Yangtao Li).

- Add online/offline/exit hooks for Tegra driver (Sumit Gupta).

- Explicitly include correct DT includes (Rob Herring).

- Frequency domain updates for qcom-hw driver (Neil Armstrong).

- Modify AMD pstate driver return the highest_perf value (Meng Li).

- Generic cleanups for cppc, mediatek and powernow driver (Liao Chang
   and Konrad Dybcio).

- Add more platforms to cpufreq-arm driver's blocklist (AngeloGioacchino
   Del Regno and Konrad Dybcio).

- brcmstb-avs-cpufreq: Fix -Warray-bounds bug (Gustavo A. R. Silva)."

* tag 'cpufreq-arm-updates-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: (33 commits)
  cpufreq: tegra194: remove opp table in exit hook
  cpufreq: powernow-k8: Use related_cpus instead of cpus in driver.exit()
  cpufreq: tegra194: add online/offline hooks
  cpufreq: qcom-cpufreq-hw: add support for 4 freq domains
  dt-bindings: cpufreq: qcom-hw: add a 4th frequency domain
  cpufreq: cppc: Set fie_disabled to FIE_DISABLED if fails to create kworker_fie
  cpufreq: cppc: cppc_cpufreq_get_rate() returns zero in all error cases.
  cpufreq: Prefer to print cpuid in MIN/MAX QoS register error message
  cpufreq: amd-pstate-ut: Modify the function to get the highest_perf value
  cpufreq: mediatek-hw: Remove unused define
  cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
  cpufreq: brcmstb-avs-cpufreq: Fix -Warray-bounds bug
  cpufreq: blocklist MSM8998 in cpufreq-dt-platdev
  cpufreq: omap: Convert to platform remove callback returning void
  cpufreq: qoriq: Convert to platform remove callback returning void
  cpufreq: acpi: Convert to platform remove callback returning void
  cpufreq: tegra186: Convert to platform remove callback returning void
  cpufreq: qcom-nvmem: Convert to platform remove callback returning void
  cpufreq: kirkwood: Convert to platform remove callback returning void
  cpufreq: pcc-cpufreq: Convert to platform remove callback returning void
  ...

Merge remote-tracking branch 'linux-efi/urgent' into efi/next

cpufreq: tegra194: remove opp table in exit hook

Add exit hook and remove OPP table when the device gets unregistered.
This will fix the error messages when the CPU FREQ driver module is
removed and then re-inserted. It also fixes these messages while
onlining the first CPU from a policy whose all CPU's were previously
offlined.

debugfs: File 'cpu5' in directory 'opp' already present!
debugfs: File 'cpu6' in directory 'opp' already present!
debugfs: File 'cpu7' in directory 'opp' already present!

Fixes: f41e1442ac5b ("cpufreq: tegra194: add OPP support and set bandwidth")
Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
[ Viresh: Dropped irrelevant change from it ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

Merge branch 'for-next' into for-linus

Pull materials for 6.5 merge window.

Signed-off-by: Takashi Iwai <tiwai@suse.de>

Merge tag 'irqchip-6.6' of git://git./linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

  - Fix for Loongsoon eiointc init error handling

  - Fix a bunch of warning showing up when -Wmissing-prototypes is set

  - A set of fixes for drivers checking for 0 as a potential return
    value from platform_get_irq()

  - Another set of patches converting existing code to the use of helpers
    such as of_address_count() and devm_platform_get_and_ioremap_resource()

  - A tree-wide cleanup of drivers including of_*.h without discrimination

  - Added support for the Amlogic C3 SoCs

Link: https://lore.kernel.org/lkml/20230828091543.4001857-1-maz@kernel.org

inet: fix IP_TRANSPARENT error handling

My recent patch forgot to change error handling for IP_TRANSPARENT
socket option.

WARNING: bad unlock balance detected!
6.5.0-rc7-syzkaller-01717-g59da9885767a #0 Not tainted
-------------------------------------
syz-executor151/5028 is trying to release lock (sk_lock-AF_INET) at:
[<ffffffff88213983>] sockopt_release_sock+0x53/0x70 net/core/sock.c:1073
but there are no more locks to release!

other info that might help us debug this:
1 lock held by syz-executor151/5028:

stack backtrace:
CPU: 0 PID: 5028 Comm: syz-executor151 Not tainted 6.5.0-rc7-syzkaller-01717-g59da9885767a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
__lock_release kernel/locking/lockdep.c:5438 [inline]
lock_release+0x4b5/0x680 kernel/locking/lockdep.c:5781
sock_release_ownership include/net/sock.h:1824 [inline]
release_sock+0x175/0x1b0 net/core/sock.c:3527
sockopt_release_sock+0x53/0x70 net/core/sock.c:1073
do_ip_setsockopt+0x12c1/0x3640 net/ipv4/ip_sockglue.c:1364
ip_setsockopt+0x59/0xe0 net/ipv4/ip_sockglue.c:1419
raw_setsockopt+0x218/0x290 net/ipv4/raw.c:833
__sys_setsockopt+0x2cd/0x5b0 net/socket.c:2305
__do_sys_setsockopt net/socket.c:2316 [inline]
__se_sys_setsockopt net/socket.c:2313 [inline]

Fixes: 4bd0623f04ee ("inet: move inet->transparent to inet->inet_flags")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: bonding: create directly devices in the target namespaces

If failed to set link1_1 to netns client, we should delete link1_1 in the
cleanup path. But if set link1_1 to netns client successfully, delete
link1_1 will report warning. So it will be safer creating directly the
devices in the target namespaces.

Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Closes: https://lore.kernel.org/all/ZNyJx1HtXaUzOkNA@Laptop-X1/
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: fix ASPM-related issues on a number of systems with NIC version from RTL8168h

This effectively reverts 4b5f82f6aaef. On a number of systems ASPM L1
causes tx timeouts with RTL8168h, see referenced bug report.

Fixes: 4b5f82f6aaef ("r8169: enable ASPM L1/L1.1 from RTL8168h")
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217814
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: tg3: remove unreachable code

'tp->irq_max' value is either 1 [L16336] or 5 [L16354], as indicated in
tg3_get_invariants(). Therefore, 'i' can't exceed 4 in tg3_init_one()
that makes (i <= 4) always true. Moreover, 'intmbx' value set at the
last iteration is not used later in it's scope.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 78f90dcf184b ("tg3: Move napi_add calls below tg3_get_invariants")
Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru>
Reviewed-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make consumed action consistent in sch_handle_egress

While looking at TC_ACT_* handling, the TC_ACT_CONSUMED is only handled in
sch_handle_ingress but not sch_handle_egress. This was added via cd11b164073b
("net/tc: introduce TC_ACT_REINSERT.") and e5cf1baf92cb ("act_mirred: use
TC_ACT_REINSERT when possible") and later got renamed into TC_ACT_CONSUMED
via 720f22fed81b ("net: sched: refactor reinsert action").

The initial work was targeted for ovs back then and only needed on ingress,
and the mirred action module also restricts it to only that. However, given
it's an API contract it would still make sense to make this consistent to
sch_handle_ingress and handle it on egress side in the same way, that is,
setting return code to "success" and returning NULL back to the caller as
otherwise an action module sitting on egress returning TC_ACT_CONSUMED could
lead to an UAF when untreated.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Fix skb consume leak in sch_handle_egress

Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:

  [...]
  unreferenced object 0xffff88818bcb4f00 (size 232):
  comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff  ..pa.....A1.....
  backtrace:
    [<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
    [<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
    [<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
    [<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
    [<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
    [<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
    [<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
    [<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
    [<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
    [<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
    [<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
    [<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
    [<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
    [<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
    [<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
    [<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
  [...]

I was able to reproduce this via:

  ip link add dev dummy0 type dummy
  ip link set dev dummy0 up
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
  ping 1.1.1.1
  <stolen>

After the fix, there are no kmemleak reports with the reproducer. This is
in line with what is also done on the ingress side, and from debugging the
skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
that these are two different skbs with both skb_unref(skb) as true. The two
seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
is false in tcf_mirred_act() for egress. This was initially reported by Gal.

Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

dccp: Fix out of bounds access in DCCP error handler

There was a previous attempt to fix an out-of-bounds access in the DCCP
error handlers, but that fix assumed that the error handlers only want
to access the first 8 bytes of the DCCP header. Actually, they also look
at the DCCP sequence number, which is stored beyond 8 bytes, so an
explicit pskb_may_pull() is required.

Fixes: 6706a97fec96 ("dccp: fix out of bound access in dccp_v4_err()")
Fixes: 1aa9d1a0e7ee ("ipv6: dccp: fix out of bound access in dccp_v6_err()")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'octeontx2-af-misc-mac-block-changes'

Hariprasad Kelam says:

====================
octeontx2-af: misc MAC block changes

This series of patches adds recent changes added in MAC (CGX/RPM) block.

Patch1: Adds new LMAC mode supported by CN10KB silicon

Patch2: In a scenario where system boots with no cgx devices, currently
        AF driver treats this as error as a result no interfaces will work.
        This patch relaxes this check, such that non cgx mapped netdev
        devices will work.

Patch3: This patch adds required lmac validation in MAC block APIs.

Patch4: Prints error message incase, no netdev is mapped with given
        cgx,lmac pair.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: print error message incase of invalid pf mapping

During AF driver initialization, it creates a mapping between pf to
cgx,lmac pair. Whenever there is a physical link change, using this
mapping driver forwards the message to the associated netdev.

This patch prints error message incase of cgx,lmac pair is not
associated with any pf netdev.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Add validation of lmac

With the addition of new MAC blocks like CN10K RPM and CN10KB
RPM_USX, LMACs are noncontiguous. Though in most of the functions,
lmac validation checks exist but in few functions they are missing.
This patch adds the same.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: Don't treat lack of CGX interfaces as error

Don't treat lack of CGX LMACs on the system as a error.
Instead ignore it so that LBK VFs are created and can be used.

Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

octeontx2-af: CN10KB: Add USGMII LMAC mode

Upon physical link change, firmware reports to the kernel about the
change along with the details like speed, lmac_type_id, etc.
Kernel derives lmac_type based on lmac_type_id received from firmware.

This patch extends current lmac list with new USGMII mode supported
by CN10KB RPM block.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: dsa: marvell: fix wrong model in compatibility list

Fix wrong switch name in compatibility list. 88E6163 switch does not exist
and is in fact 88E6361

Fixes: 9229a9483d80 ("dt-bindings: net: dsa: marvell: add MV88E6361 switch to compatibility list")
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cpufreq: powernow-k8: Use related_cpus instead of cpus in driver.exit()

Since the 'cpus' field of policy structure will become empty in the
cpufreq core API, it is better to use 'related_cpus' in the exit()
callback of driver.

Fixes: c3274763bfc3 ("cpufreq: powernow-k8: Initialize per-cpu data-structures properly")
Signed-off-by: Liao Chang <liaochang1@huawei.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

cpufreq: tegra194: add online/offline hooks

Implement the light-weight tear down and bring up helpers to reduce the
amount of work to do on CPU offline/online operation.
This change helps to make the hotplugging paths much faster.

Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
Link: https://lore.kernel.org/lkml/20230816033402.3abmugb5goypvllm@vireshk-i7/
[ Viresh: Fixed rebase conflict ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

igb: set max size RX buffer when store bad packet is enabled

Increase the RX buffer size to 3K when the SBP bit is on. The size of
the RX buffer determines the number of pages allocated which may not
be sufficient for receive frames larger than the set MTU size.

Cc: stable@vger.kernel.org
Fixes: 89eaefb61dc9 ("igb: Support RX-ALL feature flag.")
Reported-by: Manfred Rudigier <manfred.rudigier@omicronenergy.com>
Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netrom: Deny concurrent connect().

syzkaller reported null-ptr-deref [0] related to AF_NETROM.
This is another self-accept issue from the strace log. [1]

syz-executor creates an AF_NETROM socket and calls connect(), which
is blocked at that time.  Then, sk->sk_state is TCP_SYN_SENT and
sock->state is SS_CONNECTING.

  [pid  5059] socket(AF_NETROM, SOCK_SEQPACKET, 0) = 4
  [pid  5059] connect(4, {sa_family=AF_NETROM, sa_data="..." <unfinished ...>

Another thread calls connect() concurrently, which finally fails
with -EINVAL.  However, the problem here is the socket state is
reset even while the first connect() is blocked.

  [pid  5060] connect(4, NULL, 0 <unfinished ...>
  [pid  5060] <... connect resumed>)      = -1 EINVAL (Invalid argument)

As sk->state is TCP_CLOSE and sock->state is SS_UNCONNECTED, the
following listen() succeeds.  Then, the first connect() looks up
itself as a listener and puts skb into the queue with skb->sk itself.
As a result, the next accept() gets another FD of itself as 3, and
the first connect() finishes.

  [pid  5060] listen(4, 0 <unfinished ...>
  [pid  5060] <... listen resumed>)       = 0
  [pid  5060] accept(4, NULL, NULL <unfinished ...>
  [pid  5060] <... accept resumed>)       = 3
  [pid  5059] <... connect resumed>)      = 0

Then, accept4() is called but blocked, which causes the general protection
fault later.

  [pid  5059] accept4(4, NULL, 0x20000400, SOCK_NONBLOCK <unfinished ...>

After that, another self-accept occurs by accept() and writev().

  [pid  5060] accept(4, NULL, NULL <unfinished ...>
  [pid  5061] writev(3, [{iov_base=...}] <unfinished ...>
  [pid  5061] <... writev resumed>)       = 99
  [pid  5060] <... accept resumed>)       = 6

Finally, the leader thread close()s all FDs.  Since the three FDs
reference the same socket, nr_release() does the cleanup for it
three times, and the remaining accept4() causes the following fault.

  [pid  5058] close(3)                    = 0
  [pid  5058] close(4)                    = 0
  [pid  5058] close(5)                    = -1 EBADF (Bad file descriptor)
  [pid  5058] close(6)                    = 0
  [pid  5058] <... exit_group resumed>)   = ?
  [   83.456055][ T5059] general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN

To avoid the issue, we need to return an error for connect() if
another connect() is in progress, as done in __inet_stream_connect().

[0]:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 0 PID: 5059 Comm: syz-executor.0 Not tainted 6.5.0-rc5-syzkaller-00194-gace0ab3a4b54 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5012
Code: 45 85 c9 0f 84 cc 0e 00 00 44 8b 05 11 6e 23 0b 45 85 c0 0f 84 be 0d 00 00 48 ba 00 00 00 00 00 fc ff df 4c 89 d1 48 c1 e9 03 <80> 3c 11 00 0f 85 e8 40 00 00 49 81 3a a0 69 48 90 0f 84 96 0d 00
RSP: 0018:ffffc90003d6f9e0 EFLAGS: 00010006
RAX: ffff8880244c8000 RBX: 1ffff920007adf6c RCX: 0000000000000003
RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000018
RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000018 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f51d519a6c0(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f51d5158d58 CR3: 000000002943f000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
lock_acquire kernel/locking/lockdep.c:5761 [inline]
lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5726
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x3a/0x50 kernel/locking/spinlock.c:162
prepare_to_wait+0x47/0x380 kernel/sched/wait.c:269
nr_accept+0x20d/0x650 net/netrom/af_netrom.c:798
do_accept+0x3a6/0x570 net/socket.c:1872
__sys_accept4_file net/socket.c:1913 [inline]
__sys_accept4+0x99/0x120 net/socket.c:1943
__do_sys_accept4 net/socket.c:1954 [inline]
__se_sys_accept4 net/socket.c:1951 [inline]
__x64_sys_accept4+0x96/0x100 net/socket.c:1951
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f51d447cae9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f51d519a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
RAX: ffffffffffffffda RBX: 00007f51d459bf80 RCX: 00007f51d447cae9
RDX: 0000000020000400 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007f51d44c847a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000800 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007f51d459bf80 R15: 00007ffc25c34e48
</TASK>

Link: https://syzkaller.appspot.com/text?tag=CrashLog&x=152cdb63a80000
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+666c97e4686410e79649@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=666c97e4686410e79649
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: xilinx_gmii2rgmii: Convert to json schema

Convert the Xilinx GMII to RGMII Converter device tree binding
documentation to json schema.
This converter is usually used as gem <---> gmii2rgmii <---> external phy
and, it's phy-handle should point to the phandle of the external phy.

Signed-off-by: Pranavi Somisetty <pranavi.somisetty@amd.com>
Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tls-expand-tls_cipher_size_desc-to-simplify-getsockopt-setsockopt'

Sabrina Dubroca says:

====================
tls: expand tls_cipher_size_desc to simplify getsockopt/setsockopt

Commit 2d2c5ea24243 ("net/tls: Describe ciphers sizes by const
structs") introduced tls_cipher_size_desc to describe the size of the
fields of the per-cipher crypto_info structs, and commit ea7a9d88ba21
("net/tls: Use cipher sizes structs") used it, but only in
tls_device.c and tls_device_fallback.c, and skipped converting similar
code in tls_main.c and tls_sw.c.

This series expands tls_cipher_size_desc (renamed to tls_cipher_desc
to better fit this expansion) to fully describe a cipher:
- offset of the fields within the per-cipher crypto_info
- size of the full struct (for copies to/from userspace)
- offload flag
- algorithm name used by SW crypto

With these additions, we can remove ~350L of
switch (crypto_info->cipher_type) { ... }
from tls_set_device_offload, tls_sw_fallback_init,
do_tls_getsockopt_conf, do_tls_setsockopt_conf, tls_set_sw_offload
(mainly do_tls_getsockopt_conf and tls_set_sw_offload).

This series also adds the ARIA ciphers to the tls selftests, and some
more getsockopt/setsockopt tests to cover more of the code changed by
this series.
====================

Link: https://lore.kernel.org/r/cover.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: get cipher_name from cipher_desc in tls_set_sw_offload

tls_cipher_desc also contains the algorithm name needed by
crypto_alloc_aead, use it.

Finally, use get_cipher_desc to check if the cipher_type coming from
userspace is valid, and remove the cipher_type switch.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/53d021d80138aa125a9cef4468aa5ce531975a7b.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: use tls_cipher_desc to access per-cipher crypto_info in tls_set_sw_offload

The crypto_info_* helpers allow us to fetch pointers into the
per-cipher crypto_info's data.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c23af110caf0af6b68de2f86c58064913e2e902a.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: use tls_cipher_desc to get per-cipher sizes in tls_set_sw_offload

We can get rid of some local variables, but we have to keep nonce_size
because tls1.3 uses nonce_size = 0 for all ciphers.

We can also drop the runtime sanity checks on iv/rec_seq/tag size,
since we have compile time checks on those values.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/deed9c4430a62c31751a72b8c03ad66ffe710717.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: use tls_cipher_desc to simplify do_tls_getsockopt_conf

Every cipher uses the same code to update its crypto_info struct based
on the values contained in the cctx, with only the struct type and
size/offset changing. We can get those from tls_cipher_desc, and use
a single pair of memcpy and final copy_to_user.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c21a904b91e972bdbbf9d1c6d2731ccfa1eedf72.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: get crypto_info size from tls_cipher_desc in do_tls_setsockopt_conf

We can simplify do_tls_setsockopt_conf using tls_cipher_desc. Also use
get_cipher_desc's result to check if the cipher_type coming from
userspace is valid.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/e97658eb4c6a5832f8ba20a06c4f36a77763c59e.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: expand use of tls_cipher_desc in tls_sw_fallback_init

tls_sw_fallback_init already gets the key and tag size from
tls_cipher_desc. We can now also check that the cipher type is valid,
and stop hard-coding the algorithm name passed to crypto_alloc_aead.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c8c94b8fcafbfb558e09589c1f1ad48dbdf92f76.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: allocate the fallback aead after checking that the cipher is valid

No need to allocate the aead if we're going to fail afterwards.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/335e32511ed55a0b30f3f81a78fa8f323b3bdf8f.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: expand use of tls_cipher_desc in tls_set_device_offload

tls_set_device_offload is already getting iv and rec_seq sizes from
tls_cipher_desc. We can now also check if the cipher_type coming from
userspace is valid and can be offloaded.

We can also remove the runtime check on rec_seq, since we validate it
at compile time.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/8ab71b8eca856c7aaf981a45fe91ac649eb0e2e9.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: validate cipher descriptions at compile time

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/b38fb8cf60e099e82ae9979c3c9c92421042417c.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: extend tls_cipher_desc to fully describe the ciphers

- add nonce, usually equal to iv_size but not for chacha
- add offsets into the crypto_info for each field
- add algorithm name
- add offloadable flag

Also add helpers to access each field of a crypto_info struct
described by a tls_cipher_desc.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/39d5f476d63c171097764e8d38f6f158b7c109ae.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: rename tls_cipher_size_desc to tls_cipher_desc

We're going to add other fields to it to fully describe a cipher, so
the "_size" name won't match the contents.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/76ca6c7686bd6d1534dfa188fb0f1f6fabebc791.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: reduce size of tls_cipher_size_desc

tls_cipher_size_desc indexes ciphers by their type, but we're not
using indices 0..50 of the array. Each struct tls_cipher_size_desc is
20B, so that's a lot of unused memory. We can reindex the array
starting at the lowest used cipher_type.

Introduce the get_cipher_size_desc helper to find the right item and
avoid out-of-bounds accesses, and make tls_cipher_size_desc's size
explicit so that gcc reminds us to update TLS_CIPHER_MIN/MAX when we
add a new cipher.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/5e054e370e240247a5d37881a1cd93a67c15f4ca.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: add TLS_CIPHER_ARIA_GCM_* to tls_cipher_size_desc

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/b2e0fb79e6d0a4478be9bf33781dc9c9281c9d56.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: move tls_cipher_size_desc to net/tls/tls.h

It's only used in net/tls/*, no need to bloat include/net/tls.h.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/dd9fad80415e5b3575b41f56b331871038362eab.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: test some invalid inputs for setsockopt

This test will need to be updated if new ciphers are added.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/bfcfa9cffda56d2064296ab7c99a05775dd4c28e.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: add getsockopt test

The kernel accepts fetching either just the version and cipher type,
or exactly the per-cipher struct. Also check that getsockopt returns
what we just passed to the kernel.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/81a007ca13de9a74f4af45635d06682cdb385a54.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: add test variants for aria-gcm

Only supported for TLS1.2.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/ccf4a4d3f3820f8ff30431b7629f5210cb33fa89.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-net-ynl-add-support-for-netlink-raw-families'

Donald Hunter says:

====================
tools/net/ynl: Add support for netlink-raw families

This patchset adds support for netlink-raw families such as rtnetlink.

Patch 1 fixes a typo in existing schemas
Patch 2 contains the schema definition
Patches 3 & 4 update the schema documentation
Patches 5 - 9 extends ynl
Patches 10 - 12 add several netlink-raw specs

The netlink-raw schema is very similar to genetlink-legacy and I thought
about making the changes there and symlinking to it. On balance I
thought that might be problematic for accurate schema validation.

rtnetlink doesn't seem to fit into unified or directional message
enumeration models. It seems like an 'explicit' model would be useful,
to force the schema author to specify the message ids directly.

There is not yet support for notifications because ynl currently doesn't
support defining 'event' properties on a 'do' operation. The message ids
are shared so ops need to be both sync and async. I plan to look at this
in a future patch.

The link and route messages contain different nested attributes
dependent on the type of link or route. Decoding these will need some
kind of attr-space selection that uses the value of another attribute as
the selector key. These nested attributes have been left with type
'binary' for now.
====================

Link: https://lore.kernel.org/r/20230825122756.7603-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Add spec for rt route messages

Add schema for rt route with support for getroute, newroute and
delroute.

Routes can be dumped with filter attributes like this:

./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/rt_route.yaml \
--dump getroute --json '{"rtm-family": 2, "rtm-table": 254}'

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-13-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Add spec for rt link messages

Add schema for rt link with support for newlink, dellink, getlink,
setlink and getstats.

A dummy link can be created like this:

sudo ./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/rt_link.yaml \
    --do newlink --create \
    --json '{"ifname": "dummy0", "linkinfo": {"kind": "dummy"}}'

For example, offload stats can be fetched like this:

./tools/net/ynl/cli.py \
    --spec Documentation/netlink/specs/rt_link.yaml \
    --dump getstats --json '{ "filter-mask": 8 }'

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-12-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Add spec for rt addr messages

Add schema for rt addr with support for:
- newaddr, deladdr, getaddr (dump)

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-11-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Add support for create flags

Add support for using NLM_F_REPLACE, _EXCL, _CREATE and _APPEND flags
in requests.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-10-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Implement nlattr array-nest decoding in ynl

Add support for the 'array-nest' attribute type that is used by several
netlink-raw families.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-9-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Add support for netlink-raw families

Refactor the ynl code to encapsulate protocol specifics into
NetlinkProtocol and GenlProtocol.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20230825122756.7603-8-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/net/ynl: Fix extack parsing with fixed header genlmsg

Move decode_fixed_header into YnlFamily and add a _fixed_header_size
method to allow extack decoding to skip the fixed header.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-7-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/ynl: Add mcast-group schema parsing to ynl

Add a SpecMcastGroup class to the nlspec lib.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-6-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Document the netlink-raw schema extensions

Add a doc page for netlink-raw that describes the schema attributes
needed for netlink-raw.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-5-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Update genetlink-legacy documentation

Add documentation for recently added genetlink-legacy schema attributes.
Remove statements about 'work in progress' and 'todo'.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-4-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Add a schema for netlink-raw families

This schema is largely a copy of the genetlink-legacy schema with the
following modifications:

- change the schema id to netlink-raw
- add a top-level protonum property, e.g. 0 (for NETLINK_ROUTE)
- change the protocol enumeration to netlink-raw, removing the
genetlink options.
- replace doc references to generic netlink with raw netlink
- add a value property to mcast-group definitions

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-3-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: Fix typo in genetlink-* schemas

Fix typo verion -> version in genetlink-c and genetlink-legacy.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230825122756.7603-2-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'devlink-mlx5-add-port-function-attributes-for-ipsec'

Saeed Mahameed says:

====================
{devlink,mlx5}: Add port function attributes for ipsec

From Dima:

Introduce hypervisor-level control knobs to set the functionality of PCI
VF devices passed through to guests. The administrator of a hypervisor
host may choose to change the settings of a port function from the
defaults configured by the device firmware.

The software stack has two types of IPsec offload - crypto and packet.
Specifically, the ip xfrm command has sub-commands for "state" and
"policy" that have an "offload" parameter. With ip xfrm state, both
crypto and packet offload types are supported, while ip xfrm policy can
only be offloaded in packet mode.

The series introduces two new boolean attributes of a port function:
ipsec_crypto and ipsec_packet. The goal is to provide a similar level of
granularity for controlling VF IPsec offload capabilities, which would
be aligned with the software model. This will allow users to decide if
they want both types of offload enabled for a VF, just one of them, or
none at all (which is the default).

At a high level, the difference between the two knobs is that with
ipsec_crypto, only XFRM state can be offloaded. Specifically, only the
crypto operation (Encrypt/Decrypt) is offloaded. With ipsec_packet, both
XFRM state and policy can be offloaded. Furthermore, in addition to
crypto operation offload, IPsec encapsulation is also offloaded. For
XFRM state, choosing between crypto and packet offload types is
possible. From the HW perspective, different resources may be required
for each offload type.

Examples of when a user prefers to enable IPsec packet offload for a VF
when using switchdev mode:

  $ devlink port show pci/0000:06:00.0/1
      pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
          function:
          hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet disable

  $ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable

  $ devlink port show pci/0000:06:00.0/1
      pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
          function:
          hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet enable

This enables the corresponding IPsec capability of the function before
it's enumerated, so when the driver reads the capability from the device
firmware, it is enabled. The driver is then able to configure
corresponding features and ops of the VF net device to support IPsec
state and policy offloading.

v2: https://lore.kernel.org/netdev/20230421104901.897946-1-dchumak@nvidia.com/
====================

Link: https://lore.kernel.org/r/20230825062836.103744-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Implement devlink port function cmds to control ipsec_packet

Implement devlink port function commands to enable / disable IPsec
packet offloads. This is used to control the IPsec capability of the
device.

When ipsec_offload is enabled for a VF, it prevents adding IPsec packet
offloads on the PF, because the two cannot be active simultaneously due
to HW constraints. Conversely, if there are any active IPsec packet
offloads on the PF, it's not allowed to enable ipsec_packet on a VF,
until PF IPsec offloads are cleared.

Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-9-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Implement devlink port function cmds to control ipsec_crypto

Implement devlink port function commands to enable / disable IPsec
crypto offloads. This is used to control the IPsec capability of the
device.

When ipsec_crypto is enabled for a VF, it prevents adding IPsec crypto
offloads on the PF, because the two cannot be active simultaneously due
to HW constraints. Conversely, if there are any active IPsec crypto
offloads on the PF, it's not allowed to enable ipsec_crypto on a VF,
until PF IPsec offloads are cleared.

Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-8-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Provide an interface to block change of IPsec capabilities

mlx5 HW can't perform IPsec offload operation simultaneously both on PF
and VFs at the same time. While the previous patches added devlink knobs
to change IPsec capabilities dynamically, there is a need to add a logic
to block such IPsec capabilities for the cases when IPsec is already
configured.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-7-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add IFC bits to support IPsec enable/disable

Add hardware definitions to allow to control IPSec capabilities.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-6-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Rewrite IPsec vs. TC block interface

In the commit 366e46242b8e ("net/mlx5e: Make IPsec offload work together
with eswitch and TC"), new API to block IPsec vs. TC creation was introduced.

Internally, that API used devlink lock to avoid races with userspace, but it is
not really needed as dev->priv.eswitch is stable and can't be changed. So remove
dependency on devlink lock and move block encap code back to its original place.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Drop extra layer of locks in IPsec

There is no need in holding devlink lock as it gives nothing
compared to already used write mode_lock.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-4-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>