platform/kernel/linux-rpi.git
2 years agoxfrm: pass extack down to xfrm_type ->init_state
Sabrina Dubroca [Tue, 27 Sep 2022 15:45:29 +0000 (17:45 +0200)]
xfrm: pass extack down to xfrm_type ->init_state

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoMerge branch 'xfrm: add netlink extack for state creation'
Steffen Klassert [Sat, 24 Sep 2022 07:49:47 +0000 (09:49 +0200)]
Merge branch 'xfrm: add netlink extack for state creation'

Sabrina Dubroca says:

============
This is the second part of my work adding extended acks to XFRM, now
taking care of state creation. Policies were handled in the previous
series [1].
To keep this series at a reasonable size, x->type->init_state will be
handled separately.

[1] https://lkml.kernel.org/r/cover.1661162395.git.sd@queasysnail.net
============

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack support to xfrm_init_replay
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:06 +0000 (19:04 +0200)]
xfrm: add extack support to xfrm_init_replay

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to __xfrm_init_state
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:05 +0000 (19:04 +0200)]
xfrm: add extack to __xfrm_init_state

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to attach_*
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:04 +0000 (19:04 +0200)]
xfrm: add extack to attach_*

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack support to xfrm_dev_state_add
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:03 +0000 (19:04 +0200)]
xfrm: add extack support to xfrm_dev_state_add

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to verify_one_alg, verify_auth_trunc, verify_aead
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:02 +0000 (19:04 +0200)]
xfrm: add extack to verify_one_alg, verify_auth_trunc, verify_aead

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to verify_replay
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:01 +0000 (19:04 +0200)]
xfrm: add extack to verify_replay

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack support to verify_newsa_info
Sabrina Dubroca [Wed, 14 Sep 2022 17:04:00 +0000 (19:04 +0200)]
xfrm: add extack support to verify_newsa_info

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoMerge remote-tracking branch 'xfrm: start adding netlink extack support'
Steffen Klassert [Thu, 1 Sep 2022 06:14:09 +0000 (08:14 +0200)]
Merge remote-tracking branch 'xfrm: start adding netlink extack support'

Sabrina Dubroca says:

============
XFRM states and policies are complex objects, and there are many
reasons why the kernel can reject userspace's request to create
one. This series makes it a bit clearer by providing extended ack
messages for policy creation.

A few other operations that reuse the same helper functions are also
getting partial extack support in this series. More patches will
follow to complete extack support, in particular for state creation.

Note: The policy->share attribute seems to be entirely ignored in the
kernel outside of checking its value in verify_newpolicy_info(). There
are some (very) old comments in copy_from_user_policy and
copy_to_user_policy suggesting that it should at least be copied
to/from userspace. I don't know what it was intended for.
============

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to verify_sec_ctx_len
Sabrina Dubroca [Tue, 30 Aug 2022 14:23:12 +0000 (16:23 +0200)]
xfrm: add extack to verify_sec_ctx_len

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to validate_tmpl
Sabrina Dubroca [Tue, 30 Aug 2022 14:23:11 +0000 (16:23 +0200)]
xfrm: add extack to validate_tmpl

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to verify_policy_type
Sabrina Dubroca [Tue, 30 Aug 2022 14:23:10 +0000 (16:23 +0200)]
xfrm: add extack to verify_policy_type

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack to verify_policy_dir
Sabrina Dubroca [Tue, 30 Aug 2022 14:23:09 +0000 (16:23 +0200)]
xfrm: add extack to verify_policy_dir

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: add extack support to verify_newpolicy_info
Sabrina Dubroca [Tue, 30 Aug 2022 14:23:08 +0000 (16:23 +0200)]
xfrm: add extack support to verify_newpolicy_info

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: propagate extack to all netlink doit handlers
Sabrina Dubroca [Tue, 30 Aug 2022 14:23:07 +0000 (16:23 +0200)]
xfrm: propagate extack to all netlink doit handlers

xfrm_user_rcv_msg() already handles extack, we just need to pass it down.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md mode
Eyal Birger [Fri, 26 Aug 2022 11:47:00 +0000 (14:47 +0300)]
xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md mode

Allow specifying the xfrm interface if_id and link as part of a route
metadata using the lwtunnel infrastructure.

This allows for example using a single xfrm interface in collect_md
mode as the target of multiple routes each specifying a different if_id.

With the appropriate changes to iproute2, considering an xfrm device
ipsec1 in collect_md mode one can for example add a route specifying
an if_id like so:

ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1

In which case traffic routed to the device via this route would use
if_id in the xfrm interface policy lookup.

Or in the context of vrf, one can also specify the "link" property:

ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 link_dev eth15

Note: LWT_XFRM_LINK uses NLA_U32 similar to IFLA_XFRM_LINK even though
internally "link" is signed. This is consistent with other _LINK
attributes in other devices as well as in bpf and should not have an
effect as device indexes can't be negative.

Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: interface: support collect metadata mode
Eyal Birger [Fri, 26 Aug 2022 11:46:59 +0000 (14:46 +0300)]
xfrm: interface: support collect metadata mode

This commit adds support for 'collect_md' mode on xfrm interfaces.

Each net can have one collect_md device, created by providing the
IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
altered and has no if_id or link device attributes.

On transmit to this device, the if_id is fetched from the attached dst
metadata on the skb. If exists, the link property is also fetched from
the metadata. The dst metadata type used is METADATA_XFRM which holds
these properties.

On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
packet received and attaches it to the skb. The if_id used in this case is
fetched from the xfrm state, and the link is fetched from the incoming
device. This information can later be used by upper layers such as tc,
ebpf, and ip rules.

Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
metadata is postponed until after scrubing. Similarly, xfrm_input() is
adapted to avoid dropping metadata dsts by only dropping 'valid'
(skb_valid_dst(skb) == true) dsts.

Policy matching on packets arriving from collect_md xfrmi devices is
done by using the xfrm state existing in the skb's sec_path.
The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
is changed to keep the details of the if_id extraction tucked away
in xfrm_interface.c.

Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agonet: allow storing xfrm interface metadata in metadata_dst
Eyal Birger [Fri, 26 Aug 2022 11:46:58 +0000 (14:46 +0300)]
net: allow storing xfrm interface metadata in metadata_dst

XFRM interfaces provide the association of various XFRM transformations
to a netdevice using an 'if_id' identifier common to both the XFRM data
structures (polcies, states) and the interface. The if_id is configured by
the controlling entity (usually the IKE daemon) and can be used by the
administrator to define logical relations between different connections.

For example, different connections can share the if_id identifier so
that they pass through the same interface, . However, currently it is
not possible for connections using a different if_id to use the same
interface while retaining the logical separation between them, without
using additional criteria such as skb marks or different traffic
selectors.

When having a large number of connections, it is useful to have a the
logical separation offered by the if_id identifier but use a single
network interface. Similar to the way collect_md mode is used in IP
tunnels.

This patch attempts to enable different configuration mechanisms - such
as ebpf programs, LWT encapsulations, and TC - to attach metadata
to skbs which would carry the if_id. This way a single xfrm interface in
collect_md mode can demux traffic based on this configuration on tx and
provide this metadata on rx.

The XFRM metadata is somewhat similar to ip tunnel metadata in that it
has an "id", and shares similar configuration entities (bpf, tc, ...),
however, it does not necessarily represent an IP tunnel or use other
ip tunnel information, and also has an optional "link" property which
can be used for affecting underlying routing decisions.

Additional xfrm related criteria may also be added in the future.

Therefore, a new metadata type is introduced, to be used in subsequent
patches in the xfrm interface and configuration entities.

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoxfrm: Drop unused argument
Hongbin Wang [Mon, 22 Aug 2022 03:41:47 +0000 (23:41 -0400)]
xfrm: Drop unused argument

Drop unused argument from xfrm_policy_match,
__xfrm_policy_eval_candidates and xfrm_policy_eval_candidates.
No functional changes intended.

Signed-off-by: Hongbin Wang <wh_bin@126.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoselftests/net: Refactor xfrm_fill_key() to use array of structs
Gautam Menghani [Sat, 6 Aug 2022 16:35:30 +0000 (22:05 +0530)]
selftests/net: Refactor xfrm_fill_key() to use array of structs

A TODO in net/ipsec.c asks to refactor the code in xfrm_fill_key() to
use set/map to avoid manually comparing each algorithm with the "name"
parameter passed to the function as an argument. This patch refactors
the code to create an array of structs where each struct contains the
algorithm name and its corresponding key length.

Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2 years agoMerge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 11 Aug 2022 20:45:37 +0000 (13:45 -0700)]
Merge tag 'net-6.0-rc1' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, bpf, can and netfilter.

  A little larger than usual but it's all fixes, no late features. It's
  large partially because of timing, and partially because of follow ups
  to stuff that got merged a week or so before the merge window and
  wasn't as widely tested. Maybe the Bluetooth fixes are a little
  alarming so we'll address that, but the rest seems okay and not scary.

  Notably we're including a fix for the netfilter Kconfig [1], your WiFi
  warning [2] and a bluetooth fix which should unblock syzbot [3].

  Current release - regressions:

   - Bluetooth:
      - don't try to cancel uninitialized works [3]
      - L2CAP: fix use-after-free caused by l2cap_chan_put

   - tls: rx: fix device offload after recent rework

   - devlink: fix UAF on failed reload and leftover locks in mlxsw

  Current release - new code bugs:

   - netfilter:
      - flowtable: fix incorrect Kconfig dependencies [1]
      - nf_tables: fix crash when nf_trace is enabled

   - bpf:
      - use proper target btf when exporting attach_btf_obj_id
      - arm64: fixes for bpf trampoline support

   - Bluetooth:
      - ISO: unlock on error path in iso_sock_setsockopt()
      - ISO: fix info leak in iso_sock_getsockopt()
      - ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
      - ISO: fix memory corruption on iso_pinfo.base
      - ISO: fix not using the correct QoS
      - hci_conn: fix updating ISO QoS PHY

   - phy: dp83867: fix get nvmem cell fail

  Previous releases - regressions:

   - wifi: cfg80211: fix validating BSS pointers in
     __cfg80211_connect_result [2]

   - atm: bring back zatm uAPI after ATM had been removed

   - properly fix old bug making bonding ARP monitor mode not being able
     to work with software devices with lockless Tx

   - tap: fix null-deref on skb->dev in dev_parse_header_protocol

   - revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some
     devices and breaks others

   - netfilter:
      - nf_tables: many fixes rejecting cross-object linking which may
        lead to UAFs
      - nf_tables: fix null deref due to zeroed list head
      - nf_tables: validate variable length element extension

   - bgmac: fix a BUG triggered by wrong bytes_compl

   - bcmgenet: indicate MAC is in charge of PHY PM

  Previous releases - always broken:

   - bpf:
      - fix bad pointer deref in bpf_sys_bpf() injected via test infra
      - disallow non-builtin bpf programs calling the prog_run command
      - don't reinit map value in prealloc_lru_pop
      - fix UAFs during the read of map iterator fd
      - fix invalidity check for values in sk local storage map
      - reject sleepable program for non-resched map iterator

   - mptcp:
      - move subflow cleanup in mptcp_destroy_common()
      - do not queue data on closed subflows

   - virtio_net: fix memory leak inside XDP_TX with mergeable

   - vsock: fix memory leak when multiple threads try to connect()

   - rework sk_user_data sharing to prevent psock leaks

   - geneve: fix TOS inheriting for ipv4

   - tunnels & drivers: do not use RT_TOS for IPv6 flowlabel

   - phy: c45 baset1: do not skip aneg configuration if clock role is
     not specified

   - rose: avoid overflow when /proc displays timer information

   - x25: fix call timeouts in blocking connects

   - can: mcp251x: fix race condition on receive interrupt

   - can: j1939:
      - replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
      - fix memory leak of skbs in j1939_session_destroy()

  Misc:

   - docs: bpf: clarify that many things are not uAPI

   - seg6: initialize induction variable to first valid array index (to
     silence clang vs objtool warning)

   - can: ems_usb: fix clang 14's -Wunaligned-access warning"

* tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (117 commits)
  net: atm: bring back zatm uAPI
  dpaa2-eth: trace the allocated address instead of page struct
  net: add missing kdoc for struct genl_multicast_group::flags
  nfp: fix use-after-free in area_cache_get()
  MAINTAINERS: use my korg address for mt7601u
  mlxsw: minimal: Fix deadlock in ports creation
  bonding: fix reference count leak in balance-alb mode
  net: usb: qmi_wwan: Add support for Cinterion MV32
  bpf: Shut up kern_sys_bpf warning.
  net/tls: Use RCU API to access tls_ctx->netdev
  tls: rx: device: don't try to copy too much on detach
  tls: rx: device: bound the frag walk
  net_sched: cls_route: remove from list when handle is 0
  selftests: forwarding: Fix failing tests with old libnet
  net: refactor bpf_sk_reuseport_detach()
  net: fix refcount bug in sk_psock_get (2)
  selftests/bpf: Ensure sleepable program is rejected by hash map iter
  selftests/bpf: Add write tests for sk local storage map iterator
  selftests/bpf: Add tests for reading a dangling map iter fd
  bpf: Only allow sleepable program for resched-able iterator
  ...

2 years agoMerge tag 'acpi-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Thu, 11 Aug 2022 20:26:09 +0000 (13:26 -0700)]
Merge tag 'acpi-5.20-rc1-2' of git://git./linux/kernel/git/rafael/linux-pm

Pull more ACPI updates from Rafael Wysocki:
 "These fix up direct references to the fwnode field in struct device
  and extend ACPI device properties support.

  Specifics:

   - Replace direct references to the fwnode field in struct device with
     dev_fwnode() and device_match_fwnode() (Andy Shevchenko)

   - Make the ACPI code handling device properties support properties
     with buffer values (Sakari Ailus)"

* tag 'acpi-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: property: Fix error handling in acpi_init_properties()
  ACPI: VIOT: Do not dereference fwnode in struct device
  ACPI: property: Read buffer properties as integers
  ACPI: property: Add support for parsing buffer property UUID
  ACPI: property: Unify integer value reading functions
  ACPI: property: Switch node property referencing from ifs to a switch
  ACPI: property: Move property ref argument parsing into a new function
  ACPI: property: Use acpi_object_type consistently in property ref parsing
  ACPI: property: Tie data nodes to acpi handles
  ACPI: property: Return type of acpi_add_nondev_subnodes() should be bool

2 years agoMerge tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Thu, 11 Aug 2022 20:11:49 +0000 (13:11 -0700)]
Merge tag 'iomap-6.0-merge-2' of git://git./fs/xfs/xfs-linux

Pull more iomap updates from Darrick Wong:
 "In the past 10 days or so I've not heard any ZOMG STOP style
  complaints about removing ->writepage support from gfs2 or zonefs, so
  here's the pull request removing them (and the underlying fs iomap
  support) from the kernel:

   - Remove iomap_writepage and all callers, since the mm apparently
     never called the zonefs or gfs2 writepage functions"

* tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: remove iomap_writepage
  zonefs: remove ->writepage
  gfs2: remove ->writepage
  gfs2: stop using generic_writepages in gfs2_ail1_start_one

2 years agoMerge tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client
Linus Torvalds [Thu, 11 Aug 2022 19:41:07 +0000 (12:41 -0700)]
Merge tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a good pile of various fixes and cleanups from Xiubo, Jeff,
  Luis and others, almost exclusively in the filesystem.

  Several patches touch files outside of our normal purview to set the
  stage for bringing in Jeff's long awaited ceph+fscrypt series in the
  near future. All of them have appropriate acks and sat in linux-next
  for a while"

* tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client: (27 commits)
  libceph: clean up ceph_osdc_start_request prototype
  libceph: fix ceph_pagelist_reserve() comment typo
  ceph: remove useless check for the folio
  ceph: don't truncate file in atomic_open
  ceph: make f_bsize always equal to f_frsize
  ceph: flush the dirty caps immediatelly when quota is approaching
  libceph: print fsid and epoch with osd id
  libceph: check pointer before assigned to "c->rules[]"
  ceph: don't get the inline data for new creating files
  ceph: update the auth cap when the async create req is forwarded
  ceph: make change_auth_cap_ses a global symbol
  ceph: fix incorrect old_size length in ceph_mds_request_args
  ceph: switch back to testing for NULL folio->private in ceph_dirty_folio
  ceph: call netfs_subreq_terminated with was_async == false
  ceph: convert to generic_file_llseek
  ceph: fix the incorrect comment for the ceph_mds_caps struct
  ceph: don't leak snap_rwsem in handle_cap_grant
  ceph: prevent a client from exceeding the MDS maximum xattr size
  ceph: choose auth MDS for getxattr with the Xs caps
  ceph: add session already open notify support
  ...

2 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Thu, 11 Aug 2022 19:10:08 +0000 (12:10 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:

 - Xen timer fixes

 - Documentation formatting fixes

 - Make rseq selftest compatible with glibc-2.35

 - Fix handling of illegal LEA reg, reg

 - Cleanup creation of debugfs entries

 - Fix steal time cache handling bug

 - Fixes for MMIO caching

 - Optimize computation of number of LBRs

 - Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table
  Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline
  KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh
  KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers
  KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES
  KVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals
  KVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test
  KVM: selftests: Make rseq compatible with glibc-2.35
  KVM: Actually create debugfs in kvm_create_vm()
  KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()
  KVM: Get an fd before creating the VM
  KVM: Shove vcpu stats_id init into kvm_vcpu_init()
  KVM: Shove vm stats_id init into kvm_create_vm()
  KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen
  KVM: x86/mmu: rename trace function name for asynchronous page fault
  KVM: x86/xen: Stop Xen timer before changing IRQ
  KVM: x86/xen: Initialize Xen timer only once
  KVM: SVM: Disable SEV-ES support if MMIO caching is disable
  KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change
  KVM: x86: Tag kvm_mmu_x86_module_init() with __init
  ...

2 years agonet: atm: bring back zatm uAPI
Jakub Kicinski [Wed, 10 Aug 2022 16:45:47 +0000 (09:45 -0700)]
net: atm: bring back zatm uAPI

Jiri reports that linux-atm does not build without this header.
Bring it back. It's completely dead code but we can't break
the build for user space :(

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Fixes: 052e1f01bfae ("net: atm: remove support for ZeitNet ZN122x ATM devices")
Link: https://lore.kernel.org/all/8576aef3-37e4-8bae-bab5-08f82a78efd3@kernel.org/
Link: https://lore.kernel.org/r/20220810164547.484378-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agodpaa2-eth: trace the allocated address instead of page struct
Chen Lin [Thu, 11 Aug 2022 15:16:51 +0000 (23:16 +0800)]
dpaa2-eth: trace the allocated address instead of page struct

We should trace the allocated address instead of page struct.

Fixes: 27c874867c4e ("dpaa2-eth: Use a single page per Rx buffer")
Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20220811151651.3327-1-chen45464546@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge branch 'acpi-properties'
Rafael J. Wysocki [Thu, 11 Aug 2022 17:21:03 +0000 (19:21 +0200)]
Merge branch 'acpi-properties'

Merge changes adding support for device properties with buffer values
to the ACPI device properties handling code.

* acpi-properties:
  ACPI: property: Fix error handling in acpi_init_properties()
  ACPI: property: Read buffer properties as integers
  ACPI: property: Add support for parsing buffer property UUID
  ACPI: property: Unify integer value reading functions
  ACPI: property: Switch node property referencing from ifs to a switch
  ACPI: property: Move property ref argument parsing into a new function
  ACPI: property: Use acpi_object_type consistently in property ref parsing
  ACPI: property: Tie data nodes to acpi handles
  ACPI: property: Return type of acpi_add_nondev_subnodes() should be bool

2 years agonet: add missing kdoc for struct genl_multicast_group::flags
Jakub Kicinski [Tue, 9 Aug 2022 23:20:12 +0000 (16:20 -0700)]
net: add missing kdoc for struct genl_multicast_group::flags

Multicast group flags were added in commit 4d54cc32112d ("mptcp: avoid
lock_fast usage in accept path"), but it missed adding the kdoc.

Mention which flags go into that field, and do the same for
op structs.

Link: https://lore.kernel.org/r/20220809232012.403730-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge tag 'input-for-v5.20-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 11 Aug 2022 16:23:08 +0000 (09:23 -0700)]
Merge tag 'input-for-v5.20-rc0' of git://git./linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

 - changes to input core to properly queue synthetic events (such as
   autorepeat) and to release multitouch contacts when an input device
   is inhibited or suspended

 - reworked quirk handling in i8042 driver that consolidates multiple
   DMI tables into one and adds several quirks for TUXEDO line of
   laptops

 - update to mt6779 keypad to better reflect organization of the
   hardware

 - changes to mtk-pmic-keys driver preparing it to handle more variants

 - facelift of adp5588-keys driver

 - improvements to iqs7222 driver

 - adjustments to various DT binding documents for input devices

 - other assorted driver fixes.

* tag 'input-for-v5.20-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
  Input: adc-joystick - fix ordering in adc_joystick_probe()
  dt-bindings: input: ariel-pwrbutton: use spi-peripheral-props.yaml
  Input: deactivate MT slots when inhibiting or suspending devices
  Input: properly queue synthetic events
  dt-bindings: input: iqs7222: Use central 'linux,code' definition
  Input: i8042 - add dritek quirk for Acer Aspire One AO532
  dt-bindings: input: gpio-keys: accept also interrupt-extended
  dt-bindings: input: gpio-keys: reference input.yaml and document properties
  dt-bindings: input: gpio-keys: enforce node names to match all properties
  dt-bindings: input: Convert adc-keys to DT schema
  dt-bindings: input: Centralize 'linux,input-type' definition
  dt-bindings: input: Use common 'linux,keycodes' definition
  dt-bindings: input: Centralize 'linux,code' definition
  dt-bindings: input: Increase maximum keycode value to 0x2ff
  Input: mt6779-keypad - implement row/column selection
  Input: mt6779-keypad - match hardware matrix organization
  Input: i8042 - add additional TUXEDO devices to i8042 quirk tables
  Input: goodix - switch use of acpi_gpio_get_*_resource() APIs
  Input: i8042 - add TUXEDO devices to i8042 quirk tables
  Input: i8042 - add debug output for quirks
  ...

2 years agonfp: fix use-after-free in area_cache_get()
Jialiang Wang [Wed, 10 Aug 2022 07:30:57 +0000 (15:30 +0800)]
nfp: fix use-after-free in area_cache_get()

area_cache_get() is used to distribute cache->area and set cache->id,
 and if cache->id is not 0 and cache->area->kref refcount is 0, it will
 release the cache->area by nfp_cpp_area_release(). area_cache_get()
 set cache->id before cpp->op->area_init() and nfp_cpp_area_acquire().

But if area_init() or nfp_cpp_area_acquire() fails, the cache->id is
 is already set but the refcount is not increased as expected. At this
 time, calling the nfp_cpp_area_release() will cause use-after-free.

To avoid the use-after-free, set cache->id after area_init() and
 nfp_cpp_area_acquire() complete successfully.

Note: This vulnerability is triggerable by providing emulated device
 equipped with specified configuration.

 BUG: KASAN: use-after-free in nfp6000_area_init (drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c:760)
  Write of size 4 at addr ffff888005b7f4a0 by task swapper/0/1

 Call Trace:
  <TASK>
 nfp6000_area_init (drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c:760)
 area_cache_get.constprop.8 (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:884)

 Allocated by task 1:
 nfp_cpp_area_alloc_with_name (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:303)
 nfp_cpp_area_cache_add (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:802)
 nfp6000_init (drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c:1230)
 nfp_cpp_from_operations (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:1215)
 nfp_pci_probe (drivers/net/ethernet/netronome/nfp/nfp_main.c:744)

 Freed by task 1:
 kfree (mm/slub.c:4562)
 area_cache_get.constprop.8 (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:873)
 nfp_cpp_read (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:924 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c:973)
 nfp_cpp_readl (drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpplib.c:48)

Signed-off-by: Jialiang Wang <wangjialiang0806@163.com>
Reviewed-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220810073057.4032-1-wangjialiang0806@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMAINTAINERS: use my korg address for mt7601u
Jakub Kicinski [Tue, 9 Aug 2022 23:38:43 +0000 (16:38 -0700)]
MAINTAINERS: use my korg address for mt7601u

Change my address for mt7601u to the main one.

Link: https://lore.kernel.org/r/20220809233843.408004-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agomlxsw: minimal: Fix deadlock in ports creation
Vadim Pasternak [Thu, 11 Aug 2022 09:57:36 +0000 (11:57 +0200)]
mlxsw: minimal: Fix deadlock in ports creation

Drop devl_lock() / devl_unlock() from ports creation and removal flows
since the devlink instance lock is now taken by mlxsw_core.

Fixes: 72a4c8c94efa ("mlxsw: convert driver to use unlocked devlink API during init/fini")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/f4afce5ab0318617f3866b85274be52542d59b32.1660211614.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agobonding: fix reference count leak in balance-alb mode
Jay Vosburgh [Thu, 11 Aug 2022 05:06:53 +0000 (22:06 -0700)]
bonding: fix reference count leak in balance-alb mode

Commit d5410ac7b0ba ("net:bonding:support balance-alb interface
with vlan to bridge") introduced a reference count leak by not releasing
the reference acquired by ip_dev_find().  Remedy this by insuring the
reference is released.

Fixes: d5410ac7b0ba ("net:bonding:support balance-alb interface with vlan to bridge")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/26758.1660194413@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoRevert "Makefile.extrawarn: re-enable -Wformat for clang"
Linus Torvalds [Thu, 11 Aug 2022 15:40:01 +0000 (08:40 -0700)]
Revert "Makefile.extrawarn: re-enable -Wformat for clang"

This reverts commit 258fafcd0683d9ccfa524129d489948ab3ddc24c.

The clang -Wformat warning is terminally broken, and the clang people
can't seem to get their act together.

This test program causes a warning with clang:

#include <stdio.h>

int main(int argc, char **argv)
{
printf("%hhu\n", 'a');
}

resulting in

  t.c:5:19: warning: format specifies type 'unsigned char' but the argument has type 'int' [-Wformat]
          printf("%hhu\n", 'a');
                  ~~~~     ^~~
                  %d

and apparently clang people consider that a feature, because they don't
want to face the reality of how either C character constants, C
arithmetic, and C varargs functions work.

The rest of the world just shakes their head at that kind of
incompetence, and turns off -Wformat for clang again.

And no, the "you should use a pointless cast to shut this up" is not a
valid answer.  That warning should not exist in the first place, or at
least be optinal with some "-Wformat-me-harder" kind of option.

[ Admittedly, there's also very little reason to *ever* use '%hh[ud]' in
  C, but what little reason there is is entirely about 'I want to see
  only the low 8 bits of the argument'. So I would suggest nobody ever
  use that format in the first place, but if they do, the clang
  behavious is simply always wrong. Because '%hhu' takes an 'int'. It's
  that simple. ]

Reported-by: Sudip Mukherjee (Codethink) <sudipm.mukherjee@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agonet: usb: qmi_wwan: Add support for Cinterion MV32
Slark Xiao [Wed, 10 Aug 2022 01:45:21 +0000 (09:45 +0800)]
net: usb: qmi_wwan: Add support for Cinterion MV32

There are 2 models for MV32 serials. MV32-W-A is designed
based on Qualcomm SDX62 chip, and MV32-W-B is designed based
on Qualcomm SDX65 chip. So we use 2 different PID to separate it.

Test evidence as below:
T:  Bus=03 Lev=01 Prnt=01 Port=02 Cnt=03 Dev#=  3 Spd=480 MxCh= 0
D:  Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=1e2d ProdID=00f3 Rev=05.04
S:  Manufacturer=Cinterion
S:  Product=Cinterion PID 0x00F3 USB Mobile Broadband
S:  SerialNumber=d7b4be8d
C:  #Ifs= 4 Cfg#= 1 Atr=a0 MxPwr=500mA
I:  If#=0x0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan
I:  If#=0x1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
I:  If#=0x2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
I:  If#=0x3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option

T:  Bus=03 Lev=01 Prnt=01 Port=02 Cnt=03 Dev#= 10 Spd=480 MxCh= 0
D:  Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=1e2d ProdID=00f4 Rev=05.04
S:  Manufacturer=Cinterion
S:  Product=Cinterion PID 0x00F4 USB Mobile Broadband
S:  SerialNumber=d095087d
C:  #Ifs= 4 Cfg#= 1 Atr=a0 MxPwr=500mA
I:  If#=0x0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan
I:  If#=0x1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
I:  If#=0x2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
I:  If#=0x3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option

Signed-off-by: Slark Xiao <slark_xiao@163.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Link: https://lore.kernel.org/r/20220810014521.9383-1-slark_xiao@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Jakub Kicinski [Thu, 11 Aug 2022 14:50:28 +0000 (07:50 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Pull in bpf to silence a false positive warning.

* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Shut up kern_sys_bpf warning.

Link: https://lore.kernel.org/r/CAADnVQK589CZN1Q9w8huJqkEyEed+ZMTWqcpA1Rm2CjN3a4XoQ@mail.gmail.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agobpf: Shut up kern_sys_bpf warning.
Alexei Starovoitov [Thu, 11 Aug 2022 06:52:28 +0000 (23:52 -0700)]
bpf: Shut up kern_sys_bpf warning.

Shut up this warning:
kernel/bpf/syscall.c:5089:5: warning: no previous prototype for function 'kern_sys_bpf' [-Wmissing-prototypes]
int kern_sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agoKVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table
Bagas Sanjaya [Mon, 27 Jun 2022 09:51:51 +0000 (16:51 +0700)]
KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table

There is unexpected warning on KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability
table, which cause the table to be rendered as paragraph text instead.

The warning is due to missing colon at capability name and returns keyword,
as well as improper alignment on multi-line returns field.

Fix the warning by adding missing colons and aligning the field.

Link: https://lore.kernel.org/lkml/20220627181937.3be67263@canb.auug.org.au/
Fixes: 084cc29f8bbb03 ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-next@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Message-Id: <20220627095151.19339-3-bagasdotme@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoDocumentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline
Bagas Sanjaya [Mon, 27 Jun 2022 09:51:50 +0000 (16:51 +0700)]
Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline

Extend heading underline for KVM_CAP_VM_DISABLE_NX_HUGE_PAGE to match
the heading text length.

Link: https://lore.kernel.org/lkml/20220627181937.3be67263@canb.auug.org.au/
Fixes: 084cc29f8bbb03 ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-next@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Message-Id: <20220627095151.19339-2-bagasdotme@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agonet/tls: Use RCU API to access tls_ctx->netdev
Maxim Mikityanskiy [Wed, 10 Aug 2022 08:16:02 +0000 (11:16 +0300)]
net/tls: Use RCU API to access tls_ctx->netdev

Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.

Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.

Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.

The transition to RCU exposes existing issues, fixed by this commit:

1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.

2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.

Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agotls: rx: device: don't try to copy too much on detach
Jakub Kicinski [Tue, 9 Aug 2022 17:55:44 +0000 (10:55 -0700)]
tls: rx: device: don't try to copy too much on detach

Another device offload bug, we use the length of the output
skb as an indication of how much data to copy. But that skb
is sized to offset + record length, and we start from offset.
So we end up double-counting the offset which leads to
skb_copy_bits() returning -EFAULT.

Reported-by: Tariq Toukan <tariqt@nvidia.com>
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Tested-by: Ran Rozenstein <ranro@nvidia.com>
Link: https://lore.kernel.org/r/20220809175544.354343-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agotls: rx: device: bound the frag walk
Jakub Kicinski [Tue, 9 Aug 2022 17:55:43 +0000 (10:55 -0700)]
tls: rx: device: bound the frag walk

We can't do skb_walk_frags() on the input skbs, because
the input skbs is really just a pointer to the tcp read
queue. We need to bound the "is decrypted" check by the
amount of data in the message.

Note that the walk in tls_device_reencrypt() is after a
CoW so the skb there is safe to walk. Actually in the
current implementation it can't have frags at all, but
whatever, maybe one day it will.

Reported-by: Tariq Toukan <tariqt@nvidia.com>
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Tested-by: Ran Rozenstein <ranro@nvidia.com>
Link: https://lore.kernel.org/r/20220809175544.354343-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agonet_sched: cls_route: remove from list when handle is 0
Thadeu Lima de Souza Cascardo [Tue, 9 Aug 2022 17:05:18 +0000 (14:05 -0300)]
net_sched: cls_route: remove from list when handle is 0

When a route filter is replaced and the old filter has a 0 handle, the old
one won't be removed from the hashtable, while it will still be freed.

The test was there since before commit 1109c00547fc ("net: sched: RCU
cls_route"), when a new filter was not allocated when there was an old one.
The old filter was reused and the reinserting would only be necessary if an
old filter was replaced. That was still wrong for the same case where the
old handle was 0.

Remove the old filter from the list independently from its handle value.

This fixes CVE-2022-2588, also reported as ZDI-CAN-17440.

Reported-by: Zhenpeng Lin <zplin@u.northwestern.edu>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Kamal Mostafa <kamal@canonical.com>
Cc: <stable@vger.kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220809170518.164662-1-cascardo@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoselftests: forwarding: Fix failing tests with old libnet
Ido Schimmel [Tue, 9 Aug 2022 11:33:20 +0000 (14:33 +0300)]
selftests: forwarding: Fix failing tests with old libnet

The custom multipath hash tests use mausezahn in order to test how
changes in various packet fields affect the packet distribution across
the available nexthops.

The tool uses the libnet library for various low-level packet
construction and injection. The library started using the
"SO_BINDTODEVICE" socket option for IPv6 sockets in version 1.1.6 and
for IPv4 sockets in version 1.2.

When the option is not set, packets are not routed according to the
table associated with the VRF master device and tests fail.

Fix this by prefixing the command with "ip vrf exec", which will cause
the route lookup to occur in the VRF routing table. This makes the tests
pass regardless of the libnet library version.

Fixes: 511e8db54036 ("selftests: forwarding: Add test for custom multipath hash")
Fixes: 185b0c190bb6 ("selftests: forwarding: Add test for custom multipath hash with IPv4 GRE")
Fixes: b7715acba4d3 ("selftests: forwarding: Add test for custom multipath hash with IPv6 GRE")
Reported-by: Ivan Vecera <ivecera@redhat.com>
Tested-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Link: https://lore.kernel.org/r/20220809113320.751413-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Jakub Kicinski [Thu, 11 Aug 2022 04:48:14 +0000 (21:48 -0700)]
Merge https://git./linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
bpf 2022-08-10

We've added 23 non-merge commits during the last 7 day(s) which contain
a total of 19 files changed, 424 insertions(+), 35 deletions(-).

The main changes are:

1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.

2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.

3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.

4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.

5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.

6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.

7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.

8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.

9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
   offset buffer, from Aijun Sun.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
  selftests/bpf: Ensure sleepable program is rejected by hash map iter
  selftests/bpf: Add write tests for sk local storage map iterator
  selftests/bpf: Add tests for reading a dangling map iter fd
  bpf: Only allow sleepable program for resched-able iterator
  bpf: Check the validity of max_rdwr_access for sock local storage map iterator
  bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
  bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
  bpf: Acquire map uref in .init_seq_private for hash map iterator
  bpf: Acquire map uref in .init_seq_private for array map iterator
  bpf: Disallow bpf programs call prog_run command.
  bpf, arm64: Fix bpf trampoline instruction endianness
  selftests/bpf: Add test for prealloc_lru_pop bug
  bpf: Don't reinit map value in prealloc_lru_pop
  bpf: Allow calling bpf_prog_test kfuncs in tracing programs
  bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
  selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
  bpf: Use proper target btf when exporting attach_btf_obj_id
  mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
  bpf: Cleanup ftrace hash in bpf_trampoline_put
  BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
  ...
====================

Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agoMerge branch 'net-enhancements-to-sk_user_data-field'
Jakub Kicinski [Thu, 11 Aug 2022 04:48:08 +0000 (21:48 -0700)]
Merge branch 'net-enhancements-to-sk_user_data-field'

Hawkins Jiawei says:

====================
net: enhancements to sk_user_data field

This patchset fixes refcount bug by adding SK_USER_DATA_PSOCK flag bit in
sk_user_data field. The bug cause following info:

WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
 <TASK>
 __refcount_add_not_zero include/linux/refcount.h:163 [inline]
 __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
 refcount_inc_not_zero include/linux/refcount.h:245 [inline]
 sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
 sk_backlog_rcv include/net/sock.h:1061 [inline]
 __release_sock+0x134/0x3b0 net/core/sock.c:2849
 release_sock+0x54/0x1b0 net/core/sock.c:3404
 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
 __sys_shutdown_sock net/socket.c:2331 [inline]
 __sys_shutdown_sock net/socket.c:2325 [inline]
 __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
 __do_sys_shutdown net/socket.c:2351 [inline]
 __se_sys_shutdown net/socket.c:2349 [inline]
 __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
 </TASK>

To improve code maintainability, this patchset refactors sk_user_data
flags code to be more generic.
====================

Link: https://lore.kernel.org/r/cover.1659676823.git.yin31149@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agonet: refactor bpf_sk_reuseport_detach()
Hawkins Jiawei [Fri, 5 Aug 2022 07:48:36 +0000 (15:48 +0800)]
net: refactor bpf_sk_reuseport_detach()

Refactor sk_user_data dereference using more generic function
__rcu_dereference_sk_user_data_with_flags(), which improve its
maintainability

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agonet: fix refcount bug in sk_psock_get (2)
Hawkins Jiawei [Fri, 5 Aug 2022 07:48:34 +0000 (15:48 +0800)]
net: fix refcount bug in sk_psock_get (2)

Syzkaller reports refcount bug as follows:
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
 <TASK>
 __refcount_add_not_zero include/linux/refcount.h:163 [inline]
 __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
 refcount_inc_not_zero include/linux/refcount.h:245 [inline]
 sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
 sk_backlog_rcv include/net/sock.h:1061 [inline]
 __release_sock+0x134/0x3b0 net/core/sock.c:2849
 release_sock+0x54/0x1b0 net/core/sock.c:3404
 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
 __sys_shutdown_sock net/socket.c:2331 [inline]
 __sys_shutdown_sock net/socket.c:2325 [inline]
 __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
 __do_sys_shutdown net/socket.c:2351 [inline]
 __se_sys_shutdown net/socket.c:2349 [inline]
 __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
 </TASK>

During SMC fallback process in connect syscall, kernel will
replaces TCP with SMC. In order to forward wakeup
smc socket waitqueue after fallback, kernel will sets
clcsk->sk_user_data to origin smc socket in
smc_fback_replace_callbacks().

Later, in shutdown syscall, kernel will calls
sk_psock_get(), which treats the clcsk->sk_user_data
as psock type, triggering the refcnt warning.

So, the root cause is that smc and psock, both will use
sk_user_data field. So they will mismatch this field
easily.

This patch solves it by using another bit(defined as
SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
sk_user_data points to a psock object or not.
This patch depends on a PTRMASK introduced in commit f1ff5ce2cd5e
("net, sk_msg: Clear sk_user_data pointer on clone if tagged").

For there will possibly be more flags in the sk_user_data field,
this patch also refactor sk_user_data flags code to be more generic
to improve its maintainability.

Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 years agox86: link vdso and boot with -z noexecstack --no-warn-rwx-segments
Nick Desaulniers [Wed, 10 Aug 2022 22:24:41 +0000 (15:24 -0700)]
x86: link vdso and boot with -z noexecstack --no-warn-rwx-segments

Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:

  ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack
  ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
  ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions

Generally, we would like to avoid the stack being executable.  Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack.  Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.

LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO.  --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.

While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.

Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoMakefile: link with -z noexecstack --no-warn-rwx-segments
Nick Desaulniers [Wed, 10 Aug 2022 22:24:40 +0000 (15:24 -0700)]
Makefile: link with -z noexecstack --no-warn-rwx-segments

Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:

  ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack
  ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
  ld: warning: vmlinux has a LOAD segment with RWX permissions

Generally, we would like to avoid the stack being executable.  Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack.  Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.

LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO.  --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.

While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.

Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agocrypto: blake2b: effectively disable frame size warning
Linus Torvalds [Thu, 11 Aug 2022 00:59:11 +0000 (17:59 -0700)]
crypto: blake2b: effectively disable frame size warning

It turns out that gcc-12.1 has some nasty problems with register
allocation on a 32-bit x86 build for the 64-bit values used in the
generic blake2b implementation, where the pattern of 64-bit rotates and
xor operations ends up making gcc generate horrible code.

As a result it ends up with a ridiculously large stack frame for all the
spills it generates, resulting in the following build problem:

    crypto/blake2b_generic.c: In function ‘blake2b_compress_one_generic’:
    crypto/blake2b_generic.c:109:1: error: the frame size of 2640 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

on the same test-case, clang ends up generating a stack frame that is
just 296 bytes (and older gcc versions generate a slightly bigger one at
428 bytes - still nowhere near that almost 3kB monster stack frame of
gcc-12.1).

The issue is fixed both in mainline and the GCC 12 release branch [1],
but current release compilers end up failing the i386 allmodconfig build
due to this issue.

Disable the warning for now by simply raising the frame size for this
one file, just to keep this issue from having people turn off WERROR.

Link: https://lore.kernel.org/all/CAHk-=wjxqgeG2op+=W9sqgsWqCYnavC+SRfVyopu9-31S6xw+Q@mail.gmail.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 years agoMerge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Linus Torvalds [Wed, 10 Aug 2022 21:04:32 +0000 (14:04 -0700)]
Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - pNFS/flexfiles: Fix infinite looping when the RDMA connection
     errors out

  Bugfixes:
   - NFS: fix port value parsing
   - SUNRPC: Reinitialise the backchannel request buffers before reuse
   - SUNRPC: fix expiry of auth creds
   - NFSv4: Fix races in the legacy idmapper upcall
   - NFS: O_DIRECT fixes from Jeff Layton
   - NFSv4.1: Fix OP_SEQUENCE error handling
   - SUNRPC: Fix an RPC/RDMA performance regression
   - NFS: Fix case insensitive renames
   - NFSv4/pnfs: Fix a use-after-free bug in open
   - NFSv4.1: RECLAIM_COMPLETE must handle EACCES

  Features:
   - NFSv4.1: session trunking enhancements
   - NFSv4.2: READ_PLUS performance optimisations
   - NFS: relax the rules for rsize/wsize mount options
   - NFS: don't unhash dentry during unlink/rename
   - SUNRPC: Fail faster on bad verifier
   - NFS/SUNRPC: Various tracing improvements"

* tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
  NFS: Improve readpage/writepage tracing
  NFS: Improve O_DIRECT tracing
  NFS: Improve write error tracing
  NFS: don't unhash dentry during unlink/rename
  NFSv4/pnfs: Fix a use-after-free bug in open
  NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
  SUNRPC: Don't reuse bvec on retransmission of the request
  SUNRPC: Reinitialise the backchannel request buffers before reuse
  NFSv4.1: RECLAIM_COMPLETE must handle EACCES
  NFSv4.1 probe offline transports for trunking on session creation
  SUNRPC create a function that probes only offline transports
  SUNRPC export xprt_iter_rewind function
  SUNRPC restructure rpc_clnt_setup_test_and_add_xprt
  NFSv4.1 remove xprt from xprt_switch if session trunking test fails
  SUNRPC create an rpc function that allows xprt removal from rpc_clnt
  SUNRPC enable back offline transports in trunking discovery
  SUNRPC create an iterator to list only OFFLINE xprts
  NFSv4.1 offline trunkable transports on DESTROY_SESSION
  SUNRPC add function to offline remove trunkable transports
  SUNRPC expose functions for offline remote xprt functionality
  ...

2 years agoKVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh
Sean Christopherson [Wed, 27 Jul 2022 23:34:24 +0000 (23:34 +0000)]
KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh

Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written
by host userspace, zero out the number of LBR records for a vCPU during
PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of
handling the check at run-time.

guest_cpuid_has() is expensive due to the linear search of guest CPUID
entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_
simply enumerating the same "Model" as the host causes KVM to set the
number of LBR records to a non-zero value.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers
Sean Christopherson [Wed, 27 Jul 2022 23:34:23 +0000 (23:34 +0000)]
KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers

Turn vcpu_to_lbr_desc() and vcpu_to_lbr_records() into functions in order
to provide type safety, to document exactly what they return, and to
allow consuming the helpers in vmx.h.  Move the definitions as necessary
(the macros "reference" to_vmx() before its definition).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES
Sean Christopherson [Wed, 27 Jul 2022 23:34:22 +0000 (23:34 +0000)]
KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES

Refresh the PMU if userspace modifies MSR_IA32_PERF_CAPABILITIES.  KVM
consumes the vCPU's PERF_CAPABILITIES when enumerating PEBS support, but
relies on CPUID updates to refresh the PMU.  I.e. KVM will do the wrong
thing if userspace stuffs PERF_CAPABILITIES _after_ setting guest CPUID.

Opportunistically fix a curly-brace indentation.

Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals
Sean Christopherson [Thu, 4 Aug 2022 19:18:15 +0000 (12:18 -0700)]
KVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals

Test all possible input values to verify that KVM rejects all values
except the exact host value.  Due to the LBR format affecting the core
functionality of LBRs, KVM can't emulate "other" formats, so even though
there are a variety of legal values, KVM should reject anything but an
exact host match.

Suggested-by: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test
Gavin Shan [Wed, 10 Aug 2022 10:41:14 +0000 (18:41 +0800)]
KVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test

sched_getcpu() is glibc dependent and it can simply return the CPU
ID from the registered rseq information, as Florian Weimer pointed.
In this case, it's pointless to compare the return value from
sched_getcpu() and that fetched from the registered rseq information.

Fix the issue by replacing sched_getcpu() with getcpu(), as Florian
suggested. The comments are modified accordingly by replacing
"sched_getcpu()" with "getcpu()".

Reported-by: Yihuang Yu <yihyu@redhat.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Message-Id: <20220810104114.6838-3-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Make rseq compatible with glibc-2.35
Gavin Shan [Wed, 10 Aug 2022 10:41:13 +0000 (18:41 +0800)]
KVM: selftests: Make rseq compatible with glibc-2.35

The rseq information is registered by TLS, starting from glibc-2.35.
In this case, the test always fails due to syscall(__NR_rseq). For
example, on RHEL9.1 where upstream glibc-2.35 features are enabled
on downstream glibc-2.34, the test fails like below.

  # ./rseq_test
  ==== Test Assertion Failure ====
    rseq_test.c:60: !r
    pid=112043 tid=112043 errno=22 - Invalid argument
       1        0x0000000000401973: main at rseq_test.c:226
       2        0x0000ffff84b6c79b: ?? ??:0
       3        0x0000ffff84b6c86b: ?? ??:0
       4        0x0000000000401b6f: _start at ??:?
    rseq failed, errno = 22 (Invalid argument)
  # rpm -aq | grep glibc-2
  glibc-2.34-39.el9.aarch64

Fix the issue by using "../rseq/rseq.c" to fetch the rseq information,
registred by TLS if it exists. Otherwise, we're going to register our
own rseq information as before.

Reported-by: Yihuang Yu <yihyu@redhat.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Message-Id: <20220810104114.6838-2-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: Actually create debugfs in kvm_create_vm()
Oliver Upton [Wed, 20 Jul 2022 09:22:51 +0000 (09:22 +0000)]
KVM: Actually create debugfs in kvm_create_vm()

Doing debugfs creation after vm creation leaves things in a
quasi-initialized state for a while. This is further complicated by the
fact that we tear down debugfs from kvm_destroy_vm(). Align debugfs and
stats init/destroy with the vm init/destroy pattern to avoid any
headaches.

Note the fix for a benign mistake in error handling for calls to
kvm_arch_create_vm_debugfs() rolled in. Since all implementations of
the function return 0 unconditionally it isn't actually a bug at
the moment.

Lastly, tear down debugfs/stats data in the kvm_create_vm_debugfs()
error path. Previously it was safe to assume that kvm_destroy_vm() would
take out the garbage, that is no longer the case.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-6-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: Pass the name of the VM fd to kvm_create_vm_debugfs()
Oliver Upton [Wed, 20 Jul 2022 09:22:50 +0000 (09:22 +0000)]
KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()

At the time the VM fd is used in kvm_create_vm_debugfs(), the fd has
been allocated but not yet installed. It is only really useful as an
identifier in strings for the VM (such as debugfs).

Treat it exactly as such by passing the string name of the fd to
kvm_create_vm_debugfs(), futureproofing against possible misuse of the
VM fd.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-5-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: Get an fd before creating the VM
Oliver Upton [Wed, 20 Jul 2022 09:22:49 +0000 (09:22 +0000)]
KVM: Get an fd before creating the VM

Allocate a VM's fd at the very beginning of kvm_dev_ioctl_create_vm() so
that KVM can use the fd value to generate strigns, e.g. for debugfs,
when creating and initializing the VM.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-4-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: Shove vcpu stats_id init into kvm_vcpu_init()
Oliver Upton [Wed, 20 Jul 2022 09:22:48 +0000 (09:22 +0000)]
KVM: Shove vcpu stats_id init into kvm_vcpu_init()

Initialize stats_id alongside other kvm_vcpu fields to make it more
difficult to unintentionally access stats_id before it's set.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-3-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: Shove vm stats_id init into kvm_create_vm()
Oliver Upton [Wed, 20 Jul 2022 09:22:47 +0000 (09:22 +0000)]
KVM: Shove vm stats_id init into kvm_create_vm()

Initialize stats_id alongside other struct kvm fields to make it more
difficult to unintentionally access stats_id before it's set.  While at
it, move the format string to the first line of the call and fix the
indentation of the second line.

No functional change intended.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220720092259.3491733-2-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen
Sean Christopherson [Fri, 5 Aug 2022 19:41:33 +0000 (19:41 +0000)]
KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen

Add compile-time and init-time sanity checks to ensure that the MMIO SPTE
mask doesn't overlap the MMIO SPTE generation or the MMU-present bit.
The generation currently avoids using bit 63, but that's as much
coincidence as it is strictly necessarly.  That will change in the future,
as TDX support will require setting bit 63 (SUPPRESS_VE) in the mask.

Explicitly carve out the bits that are allowed in the mask so that any
future shuffling of SPTE bits doesn't silently break MMIO caching (KVM
has broken MMIO caching more than once due to overlapping the generation
with other things).

Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220805194133.86299-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/mmu: rename trace function name for asynchronous page fault
Mingwei Zhang [Sun, 7 Aug 2022 05:21:41 +0000 (05:21 +0000)]
KVM: x86/mmu: rename trace function name for asynchronous page fault

Rename the tracepoint function from trace_kvm_async_pf_doublefault() to
trace_kvm_async_pf_repeated_fault() to make it clear, since double fault
has nothing to do with this trace function.

Asynchronous Page Fault (APF) is an artifact generated by KVM when it
cannot find a physical page to satisfy an EPT violation. KVM uses APF to
tell the guest OS to do something else such as scheduling other guest
processes to make forward progress. However, when another guest process
also touches a previously APFed page, KVM halts the vCPU instead of
generating a repeated APF to avoid wasting cycles.

Double fault (#DF) clearly has a different meaning and a different
consequence when triggered. #DF requires two nested contributory exceptions
instead of two page faults faulting at the same address. A prevous bug on
APF indicates that it may trigger a double fault in the guest [1] and
clearly this trace function has nothing to do with it. So rename this
function should be a valid choice.

No functional change intended.

[1] https://www.spinics.net/lists/kvm/msg214957.html

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220807052141.69186-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Stop Xen timer before changing IRQ
Coleman Dietsch [Mon, 8 Aug 2022 19:06:07 +0000 (14:06 -0500)]
KVM: x86/xen: Stop Xen timer before changing IRQ

Stop Xen timer (if it's running) prior to changing the IRQ vector and
potentially (re)starting the timer. Changing the IRQ vector while the
timer is still running can result in KVM injecting a garbage event, e.g.
vm_xen_inject_timer_irqs() could see a non-zero xen.timer_pending from
a previous timer but inject the new xen.timer_virq.

Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Cc: stable@vger.kernel.org
Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42
Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com
Signed-off-by: Coleman Dietsch <dietschc@csp.edu>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220808190607.323899-3-dietschc@csp.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Initialize Xen timer only once
Coleman Dietsch [Mon, 8 Aug 2022 19:06:06 +0000 (14:06 -0500)]
KVM: x86/xen: Initialize Xen timer only once

Add a check for existing xen timers before initializing a new one.

Currently kvm_xen_init_timer() is called on every
KVM_XEN_VCPU_ATTR_TYPE_TIMER, which is causing the following ODEBUG
crash when vcpu->arch.xen.timer is already set.

ODEBUG: init active (active state 0)
object type: hrtimer hint: xen_timer_callbac0
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:502
Call Trace:
__debug_object_init
debug_hrtimer_init
debug_init
hrtimer_init
kvm_xen_init_timer
kvm_xen_vcpu_set_attr
kvm_arch_vcpu_ioctl
kvm_vcpu_ioctl
vfs_ioctl

Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Cc: stable@vger.kernel.org
Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42
Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com
Signed-off-by: Coleman Dietsch <dietschc@csp.edu>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220808190607.323899-2-dietschc@csp.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: SVM: Disable SEV-ES support if MMIO caching is disable
Sean Christopherson [Wed, 3 Aug 2022 22:49:57 +0000 (22:49 +0000)]
KVM: SVM: Disable SEV-ES support if MMIO caching is disable

Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs
generating #NPF(RSVD), which are reflected by the CPU into the guest as
a #VC.  With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access
to the guest instruction stream or register state and so can't directly
emulate in response to a #NPF on an emulated MMIO GPA.  Disabling MMIO
caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT),
and those flavors of #NPF cause automatic VM-Exits, not #VC.

Adjust KVM's MMIO masks to account for the C-bit location prior to doing
SEV(-ES) setup, and document that dependency between adjusting the MMIO
SPTE mask and SEV(-ES) setup.

Fixes: b09763da4dd8 ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)")
Reported-by: Michael Roth <michael.roth@amd.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change
Sean Christopherson [Wed, 3 Aug 2022 22:49:56 +0000 (22:49 +0000)]
KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change

Fully re-evaluate whether or not MMIO caching can be enabled when SPTE
masks change; simply clearing enable_mmio_caching when a configuration
isn't compatible with caching fails to handle the scenario where the
masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit
location, and toggle compatibility from false=>true.

Snapshot the original module param so that re-evaluating MMIO caching
preserves userspace's desire to allow caching.  Use a snapshot approach
so that enable_mmio_caching still reflects KVM's actual behavior.

Fixes: 8b9e74bfbf8c ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled")
Reported-by: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Tested-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220803224957.1285926-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: Tag kvm_mmu_x86_module_init() with __init
Sean Christopherson [Wed, 3 Aug 2022 22:49:55 +0000 (22:49 +0000)]
KVM: x86: Tag kvm_mmu_x86_module_init() with __init

Mark kvm_mmu_x86_module_init() with __init, the entire reason it exists
is to initialize variables when kvm.ko is loaded, i.e. it must never be
called after module initialization.

Fixes: 1d0e84806047 ("KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: emulator: Fix illegal LEA handling
Michal Luczaj [Fri, 29 Jul 2022 13:48:01 +0000 (15:48 +0200)]
KVM: x86: emulator: Fix illegal LEA handling

The emulator mishandles LEA with register source operand. Even though such
LEA is illegal, it can be encoded and fed to CPU. In which case real
hardware throws #UD. The emulator, instead, returns address of
x86_emulate_ctxt._regs. This info leak hurts host's kASLR.

Tell the decoder that illegal LEA is not to be emulated.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20220729134801.1120-1-mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: X86: avoid uninitialized 'fault.async_page_fault' from fixed-up #PF
Yu Zhang [Mon, 18 Jul 2022 07:47:56 +0000 (15:47 +0800)]
KVM: X86: avoid uninitialized 'fault.async_page_fault' from fixed-up #PF

kvm_fixup_and_inject_pf_error() was introduced to fixup the error code(
e.g., to add RSVD flag) and inject the #PF to the guest, when guest
MAXPHYADDR is smaller than the host one.

When it comes to nested, L0 is expected to intercept and fix up the #PF
and then inject to L2 directly if
- L2.MAXPHYADDR < L0.MAXPHYADDR and
- L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the
  same MAXPHYADDR value && L1 is using EPT for L2),
instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK
and PFEC_MATCH both set to 0 in vmcs02, the interception and injection
may happen on all L2 #PFs.

However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error()
may cause the fault.async_page_fault being NOT zeroed, and later the #PF
being treated as a nested async page fault, and then being injected to L1.
Instead of zeroing 'fault' at the beginning of this function, we mannually
set the value of 'fault.async_page_fault', because false is the value we
really expect.

Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178
Reported-by: Yang Lixiao <lixiao.yang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220718074756.53788-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: Bug the VM if an accelerated x2APIC trap occurs on a "bad" reg
Sean Christopherson [Thu, 4 Aug 2022 23:50:28 +0000 (23:50 +0000)]
KVM: x86: Bug the VM if an accelerated x2APIC trap occurs on a "bad" reg

Bug the VM if retrieving the x2APIC MSR/register while processing an
accelerated vAPIC trap VM-Exit fails.  In theory it's impossible for the
lookup to fail as hardware has already validated the register, but bugs
happen, and not checking the result of kvm_lapic_msr_read() would result
in consuming the uninitialized "val" if a KVM or hardware bug occurs.

Fixes: 1bd9dfec9fd4 ("KVM: x86: Do not block APIC write for non ICR registers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220804235028.1766253-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: do not report preemption if the steal time cache is stale
Paolo Bonzini [Thu, 4 Aug 2022 13:28:32 +0000 (15:28 +0200)]
KVM: x86: do not report preemption if the steal time cache is stale

Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time
/ preempted status", 2021-11-11) open coded the previous call to
kvm_map_gfn, but in doing so it dropped the comparison between the cached
guest physical address and the one in the MSR.  This cause an incorrect
cache hit if the guest modifies the steal time address while the memslots
remain the same.  This can happen with kexec, in which case the preempted
bit is written at the address used by the old kernel instead of
the old one.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: revalidate steal time cache if MSR value changes
Paolo Bonzini [Thu, 4 Aug 2022 13:28:32 +0000 (15:28 +0200)]
KVM: x86: revalidate steal time cache if MSR value changes

Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time
/ preempted status", 2021-11-11) open coded the previous call to
kvm_map_gfn, but in doing so it dropped the comparison between the cached
guest physical address and the one in the MSR.  This cause an incorrect
cache hit if the guest modifies the steal time address while the memslots
remain the same.  This can happen with kexec, in which case the steal
time data is written at the address used by the old kernel instead of
the old one.

While at it, rename the variable from gfn to gpa since it is a plain
physical address and not a right-shifted one.

Reported-by: Dave Young <ruyang@redhat.com>
Reported-by: Xiaoying Yan <yiyan@redhat.com>
Analyzed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoselftests: kvm: fix compilation
Paolo Bonzini [Wed, 10 Aug 2022 18:55:27 +0000 (14:55 -0400)]
selftests: kvm: fix compilation

Commit  49de12ba06ef ("selftests: drop KSFT_KHDR_INSTALL make target")
dropped from tools/testing/selftests/lib.mk the code related to KSFT_KHDR_INSTALL,
but in doing so it also dropped the definition of the ARCH variable.  The ARCH
variable is used in several subdirectories, but kvm/ is the only one of these
that was using KSFT_KHDR_INSTALL.

As a result, kvm selftests cannot be built anymore:

In file included from include/x86_64/vmx.h:12,
                 from x86_64/vmx_pmu_caps_test.c:18:
include/x86_64/processor.h:15:10: fatal error: asm/msr-index.h: No such file or directory
   15 | #include <asm/msr-index.h>
      |          ^~~~~~~~~~~~~~~~~

In file included from ../../../../tools/include/asm/atomic.h:6,
                 from ../../../../tools/include/linux/atomic.h:5,
                 from rseq_test.c:15:
../../../../tools/include/asm/../../arch/x86/include/asm/atomic.h:11:10: fatal error: asm/cmpxchg.h: No such file or directory
   11 | #include <asm/cmpxchg.h>
      |          ^~~~~~~~~~~~~~~

Fix it by including the definition that was present in lib.mk.

Fixes: 49de12ba06ef ("selftests: drop KSFT_KHDR_INSTALL make target")
Cc: Guillaume Tucker <guillaume.tucker@collabora.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoMerge tag 'hwmon-fixes-for-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 10 Aug 2022 18:30:16 +0000 (11:30 -0700)]
Merge tag 'hwmon-fixes-for-v6.0-rc1' of git://git./linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:
 "Fix two regressions in nct6775 and lm90 drivers"

* tag 'hwmon-fixes-for-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (nct6775) Fix platform driver suspend regression
  hwmon: (lm90) Fix error return value from detect function

2 years agoMerge tag 'rpmsg-v5.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc...
Linus Torvalds [Wed, 10 Aug 2022 18:28:14 +0000 (11:28 -0700)]
Merge tag 'rpmsg-v5.20-1' of git://git./linux/kernel/git/remoteproc/linux

Pull rpmsg fixes from Bjorn Andersson:
 "This fixes schema validation warnings in the Devicetree bindings for
  SMD and SMD RPM"

* tag 'rpmsg-v5.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  dt-bindings: soc: qcom: smd-rpm: extend example
  dt-bindings: soc: qcom: smd: reference SMD edge schema

2 years agoMerge tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 10 Aug 2022 18:18:00 +0000 (11:18 -0700)]
Merge tag 'mm-stable-2022-08-09' of git://git./linux/kernel/git/akpm/mm

Pull remaining MM updates from Andrew Morton:
 "Three patch series - two that perform cleanups and one feature:

   - hugetlb_vmemmap cleanups from Muchun Song

   - hardware poisoning support for 1GB hugepages, from Naoya Horiguchi

   - highmem documentation fixups from Fabio De Francesco"

* tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
  Documentation/mm: add details about kmap_local_page() and preemption
  highmem: delete a sentence from kmap_local_page() kdocs
  Documentation/mm: rrefer kmap_local_page() and avoid kmap()
  Documentation/mm: avoid invalid use of addresses from kmap_local_page()
  Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
  highmem: specify that kmap_local_page() is callable from interrupts
  highmem: remove unneeded spaces in kmap_local_page() kdocs
  mm, hwpoison: enable memory error handling on 1GB hugepage
  mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  mm, hwpoison: make __page_handle_poison returns int
  mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
  mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
  mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
  mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst
  mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  mm: hugetlb_vmemmap: replace early_param() with core_param()
  mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  ...

2 years agoMerge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Linus Torvalds [Wed, 10 Aug 2022 18:07:26 +0000 (11:07 -0700)]
Merge tag 'cxl-for-6.0' of git://git./linux/kernel/git/cxl/cxl

Pull cxl updates from Dan Williams:
 "Compute Express Link (CXL) updates for 6.0:

   - Introduce a 'struct cxl_region' object with support for
     provisioning and assembling persistent memory regions.

   - Introduce alloc_free_mem_region() to accompany the existing
     request_free_mem_region() as a method to allocate physical memory
     capacity out of an existing resource.

   - Export insert_resource_expand_to_fit() for the CXL subsystem to
     late-publish CXL platform windows in iomem_resource.

   - Add a polled mode PCI DOE (Data Object Exchange) driver service and
     use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
     Table)"

* tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits)
  cxl/hdm: Fix skip allocations vs multiple pmem allocations
  cxl/region: Disallow region granularity != window granularity
  cxl/region: Fix x1 interleave to greater than x1 interleave routing
  cxl/region: Move HPA setup to cxl_region_attach()
  cxl/region: Fix decoder interleave programming
  Documentation: cxl: remove dangling kernel-doc reference
  cxl/region: describe targets and nr_targets members of cxl_region_params
  cxl/regions: add padding for cxl_rr_ep_add nested lists
  cxl/region: Fix IS_ERR() vs NULL check
  cxl/region: Fix region reference target accounting
  cxl/region: Fix region commit uninitialized variable warning
  cxl/region: Fix port setup uninitialized variable warnings
  cxl/region: Stop initializing interleave granularity
  cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime
  cxl/acpi: Minimize granularity for x1 interleaves
  cxl/region: Delete 'region' attribute from root decoders
  cxl/acpi: Autoload driver for 'cxl_acpi' test devices
  cxl/region: decrement ->nr_targets on error in cxl_region_attach()
  cxl/region: prevent underflow in ways_to_cxl()
  cxl/region: uninitialized variable in alloc_hpa()
  ...

2 years agoMerge tag 'pinctrl-v6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Wed, 10 Aug 2022 18:01:44 +0000 (11:01 -0700)]
Merge tag 'pinctrl-v6.0-1' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control updates from Linus Walleij:
 "Outside the pinctrl driver and DT bindings we hit some Arm DT files,
  patched by the maintainers.

  Other than that it is business as usual.

  Core changes:

   - Add PINCTRL_PINGROUP() helper macro (and use it in the AMD driver).

  New drivers:

   - Intel Meteor Lake support.

   - Reneasas RZ/V2M and r8a779g0 (R-Car V4H).

   - AXP209 variants AXP221, AXP223 and AXP809.

   - Qualcomm MSM8909, PM8226, PMP8074 and SM6375.

   - Allwinner D1.

  Improvements:

   - Proper pin multiplexing in the AMD driver.

   - Mediatek MT8192 can use generic drive strength and pin bias, then
     fixes on top plus some I2C pin group fixes.

   - Have the Allwinner Sunplus SP7021 use the generic DT schema and
     make interrupts optional.

   - Handle Qualcomm SC7280 ADSP.

   - Handle Qualcomm MSM8916 CAMSS GP clock muxing.

   - High impedance bias on ZynqMP.

   - Serialize StarFive access to MMIO.

   - Immutable gpiochip for BCM2835, Ingenic, Qualcomm SPMI GPIO"

* tag 'pinctrl-v6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (117 commits)
  dt-bindings: pinctrl: qcom,pmic-gpio: add PM8226 constraints
  pinctrl: qcom: Make PINCTRL_SM8450 depend on PINCTRL_MSM
  pinctrl: qcom: sm8250: Fix PDC map
  pinctrl: amd: Fix an unused variable
  dt-bindings: pinctrl: mt8186: Add and use drive-strength-microamp
  dt-bindings: pinctrl: mt8186: Add gpio-line-names property
  ARM: dts: imxrt1170-pinfunc: Add pinctrl binding header
  pinctrl: amd: Use unicode for debugfs output
  pinctrl: amd: Fix newline declaration in debugfs output
  pinctrl: at91: Fix typo 'the the' in comment
  dt-bindings: pinctrl: st,stm32: Correct 'resets' property name
  pinctrl: mvebu: Missing a blank line after declarations.
  pinctrl: qcom: Add SM6375 TLMM driver
  dt-bindings: pinctrl: Add DT schema for SM6375 TLMM
  dt-bindings: pinctrl: mt8195: Use drive-strength-microamp in examples
  Revert "pinctrl: qcom: spmi-gpio: make the irqchip immutable"
  pinctrl: imx93: Add MODULE_DEVICE_TABLE()
  pinctrl: sunxi: Add driver for Allwinner D1
  pinctrl: sunxi: Make some layout parameters dynamic
  pinctrl: sunxi: Refactor register/offset calculation
  ...

2 years agoMerge tag 'apparmor-pr-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 10 Aug 2022 17:53:22 +0000 (10:53 -0700)]
Merge tag 'apparmor-pr-2022-08-08' of git://git./linux/kernel/git/jj/linux-apparmor

Pull AppArmor updates from John Johansen:
 "This is mostly cleanups and bug fixes with the one bigger change being
  Mathew Wilcox's patch to use XArrays instead of the IDR from the
  thread around the locking weirdness.

  Features:
   - Convert secid mapping to XArrays instead of IDR
   - Add a kernel label to use on kernel objects
   - Extend policydb permission set by making use of the xbits
   - Make export of raw binary profile to userspace optional
   - Enable tuning of policy paranoid load for embedded systems
   - Don't create raw_sha1 symlink if sha1 hashing is disabled
   - Allow labels to carry debug flags

  Cleanups:
   - Update MAINTAINERS file
   - Use struct_size() helper in kmalloc()
   - Move ptrace mediation to more logical task.{h,c}
   - Resolve uninitialized symbol warnings
   - Remove redundant ret variable
   - Mark alloc_unconfined() as static
   - Update help description of policy hash for introspection
   - Remove some casts which are no-longer required

  Bug Fixes:
   - Fix aa_label_asxprint return check
   - Fix reference count leak in aa_pivotroot()
   - Fix memleak in aa_simple_write_to_buffer()
   - Fix kernel doc comments
   - Fix absroot causing audited secids to begin with =
   - Fix quiet_denied for file rules
   - Fix failed mount permission check error message
   - Disable showing the mode as part of a secid to secctx
   - Fix setting unconfined mode on a loaded profile
   - Fix overlapping attachment computation
   - Fix undefined reference to `zlib_deflate_workspacesize'"

* tag 'apparmor-pr-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor: (34 commits)
  apparmor: Update MAINTAINERS file with new email address
  apparmor: correct config reference to intended one
  apparmor: move ptrace mediation to more logical task.{h,c}
  apparmor: extend policydb permission set by making use of the xbits
  apparmor: allow label to carry debug flags
  apparmor: fix overlapping attachment computation
  apparmor: fix setting unconfined mode on a loaded profile
  apparmor: Fix some kernel-doc comments
  apparmor: Mark alloc_unconfined() as static
  apparmor: disable showing the mode as part of a secid to secctx
  apparmor: Convert secid mapping to XArrays instead of IDR
  apparmor: add a kernel label to use on kernel objects
  apparmor: test: Remove some casts which are no-longer required
  apparmor: Fix memleak in aa_simple_write_to_buffer()
  apparmor: fix reference count leak in aa_pivotroot()
  apparmor: Fix some kernel-doc comments
  apparmor: Fix undefined reference to `zlib_deflate_workspacesize'
  apparmor: fix aa_label_asxprint return check
  apparmor: Fix some kernel-doc comments
  apparmor: Fix some kernel-doc comments
  ...

2 years agoMerge tag 'kbuild-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy...
Linus Torvalds [Wed, 10 Aug 2022 17:40:41 +0000 (10:40 -0700)]
Merge tag 'kbuild-v5.20' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Remove the support for -O3 (CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3)

 - Fix error of rpm-pkg cross-builds

 - Support riscv for checkstack tool

 - Re-enable -Wformwat warnings for Clang

 - Clean up modpost, Makefiles, and misc scripts

* tag 'kbuild-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (30 commits)
  modpost: remove .symbol_white_list field entirely
  modpost: remove unneeded .symbol_white_list initializers
  modpost: add PATTERNS() helper macro
  modpost: shorten warning messages in report_sec_mismatch()
  Revert "Kbuild, lto, workaround: Don't warn for initcall_reference in modpost"
  modpost: use more reliable way to get fromsec in section_rel(a)()
  modpost: add array range check to sec_name()
  modpost: refactor get_secindex()
  kbuild: set EXIT trap before creating temporary directory
  modpost: remove unused Elf_Sword macro
  Makefile.extrawarn: re-enable -Wformat for clang
  kbuild: add dtbs_prepare target
  kconfig: Qt5: tell the user which packages are required
  modpost: use sym_get_data() to get module device_table data
  modpost: drop executable ELF support
  checkstack: add riscv support for scripts/checkstack.pl
  kconfig: shorten the temporary directory name for cc-option
  scripts: headers_install.sh: Update config leak ignore entries
  kbuild: error out if $(INSTALL_MOD_PATH) contains % or :
  kbuild: error out if $(KBUILD_EXTMOD) contains % or :
  ...

2 years agoMerge branch 'fixes for bpf map iterator'
Alexei Starovoitov [Wed, 10 Aug 2022 17:12:48 +0000 (10:12 -0700)]
Merge branch 'fixes for bpf map iterator'

Hou Tao says:

====================

From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset constitues three fixes for bpf map iterator:

(1) patch 1~4: fix user-after-free during reading map iterator fd
It is possible when both the corresponding link fd and map fd are
closed bfore reading the iterator fd. I had squashed these four patches
into one, but it was not friendly for stable backport, so I break these
fixes into four single patches in the end. Patch 7 is its testing patch.

(2) patch 5: fix invalidity check for values in sk local storage map
Patch 8 adds two tests for it.

(3) patch 6: reject sleepable program for non-resched map iterator
Patch 9 add a test for it.

Please check the individual patches for more details. And comments are
always welcome.

Regards,
Tao

Changes since v2:
* patch 1~6: update commit messages (from Yonghong & Martin)
* patch 7: add more detailed comments (from Yonghong)
* patch 8: use NULL directly instead of (void *)0

v1: https://lore.kernel.org/bpf/20220806074019.2756957-1-houtao@huaweicloud.com
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agoselftests/bpf: Ensure sleepable program is rejected by hash map iter
Hou Tao [Wed, 10 Aug 2022 08:05:38 +0000 (16:05 +0800)]
selftests/bpf: Ensure sleepable program is rejected by hash map iter

Add a test to ensure sleepable program is rejected by hash map iterator.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-10-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agoselftests/bpf: Add write tests for sk local storage map iterator
Hou Tao [Wed, 10 Aug 2022 08:05:37 +0000 (16:05 +0800)]
selftests/bpf: Add write tests for sk local storage map iterator

Add test to validate the overwrite of sock local storage map value in
map iterator and another one to ensure out-of-bound value writing is
rejected.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-9-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agoselftests/bpf: Add tests for reading a dangling map iter fd
Hou Tao [Wed, 10 Aug 2022 08:05:36 +0000 (16:05 +0800)]
selftests/bpf: Add tests for reading a dangling map iter fd

After closing both related link fd and map fd, reading the map
iterator fd to ensure it is OK to do so.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-8-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Only allow sleepable program for resched-able iterator
Hou Tao [Wed, 10 Aug 2022 08:05:35 +0000 (16:05 +0800)]
bpf: Only allow sleepable program for resched-able iterator

When a sleepable program is attached to a hash map iterator, might_fault()
will report "BUG: sleeping function called from invalid context..." if
CONFIG_DEBUG_ATOMIC_SLEEP is enabled. The reason is that rcu_read_lock()
is held in bpf_hash_map_seq_next() and won't be released until all elements
are traversed or bpf_hash_map_seq_stop() is called.

Fixing it by reusing BPF_ITER_RESCHED to indicate that only non-sleepable
program is allowed for iterator without BPF_ITER_RESCHED. We can revise
bpf_iter_link_attach() later if there are other conditions which may
cause rcu_read_lock() or spin_lock() issues.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Check the validity of max_rdwr_access for sock local storage map iterator
Hou Tao [Wed, 10 Aug 2022 08:05:34 +0000 (16:05 +0800)]
bpf: Check the validity of max_rdwr_access for sock local storage map iterator

The value of sock local storage map is writable in map iterator, so check
max_rdwr_access instead of max_rdonly_access.

Fixes: 5ce6e77c7edf ("bpf: Implement bpf iterator for sock local storage map")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
Hou Tao [Wed, 10 Aug 2022 08:05:33 +0000 (16:05 +0800)]
bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator

sock_map_iter_attach_target() acquires a map uref, and the uref may be
released before or in the middle of iterating map elements. For example,
the uref could be released in sock_map_iter_detach_target() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

Fixing it by acquiring an extra map uref in .init_seq_private and
releasing it in .fini_seq_private.

Fixes: 0365351524d7 ("net: Allow iterating sockmap and sockhash")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Acquire map uref in .init_seq_private for sock local storage map iterator
Hou Tao [Wed, 10 Aug 2022 08:05:32 +0000 (16:05 +0800)]
bpf: Acquire map uref in .init_seq_private for sock local storage map iterator

bpf_iter_attach_map() acquires a map uref, and the uref may be released
before or in the middle of iterating map elements. For example, the uref
could be released in bpf_iter_detach_map() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

So acquiring an extra map uref in bpf_iter_init_sk_storage_map() and
releasing it in bpf_iter_fini_sk_storage_map().

Fixes: 5ce6e77c7edf ("bpf: Implement bpf iterator for sock local storage map")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Acquire map uref in .init_seq_private for hash map iterator
Hou Tao [Wed, 10 Aug 2022 08:05:31 +0000 (16:05 +0800)]
bpf: Acquire map uref in .init_seq_private for hash map iterator

bpf_iter_attach_map() acquires a map uref, and the uref may be released
before or in the middle of iterating map elements. For example, the uref
could be released in bpf_iter_detach_map() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

So acquiring an extra map uref in bpf_iter_init_hash_map() and
releasing it in bpf_iter_fini_hash_map().

Fixes: d6c4503cc296 ("bpf: Implement bpf iterator for hash maps")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Acquire map uref in .init_seq_private for array map iterator
Hou Tao [Wed, 10 Aug 2022 08:05:30 +0000 (16:05 +0800)]
bpf: Acquire map uref in .init_seq_private for array map iterator

bpf_iter_attach_map() acquires a map uref, and the uref may be released
before or in the middle of iterating map elements. For example, the uref
could be released in bpf_iter_detach_map() as part of
bpf_link_release(), or could be released in bpf_map_put_with_uref() as
part of bpf_map_release().

Alternative fix is acquiring an extra bpf_link reference just like
a pinned map iterator does, but it introduces unnecessary dependency
on bpf_link instead of bpf_map.

So choose another fix: acquiring an extra map uref in .init_seq_private
for array map iterator.

Fixes: d3cc2ab546ad ("bpf: Implement bpf iterator for array maps")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220810080538.1845898-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf: Disallow bpf programs call prog_run command.
Alexei Starovoitov [Tue, 9 Aug 2022 03:58:09 +0000 (20:58 -0700)]
bpf: Disallow bpf programs call prog_run command.

The verifier cannot perform sufficient validation of bpf_attr->test.ctx_in
pointer, therefore bpf programs should not be allowed to call BPF_PROG_RUN
command from within the program.
To fix this issue split bpf_sys_bpf() bpf helper into normal kern_sys_bpf()
kernel function that can only be used by the kernel light skeleton directly.

Reported-by: YiFei Zhu <zhuyifei@google.com>
Fixes: b1d18a7574d0 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 years agobpf, arm64: Fix bpf trampoline instruction endianness
Xu Kuohai [Mon, 8 Aug 2022 04:07:35 +0000 (00:07 -0400)]
bpf, arm64: Fix bpf trampoline instruction endianness

The sparse tool complains as follows:

arch/arm64/net/bpf_jit_comp.c:1684:16:
warning: incorrect type in assignment (different base types)
arch/arm64/net/bpf_jit_comp.c:1684:16:
expected unsigned int [usertype] *branch
arch/arm64/net/bpf_jit_comp.c:1684:16:
got restricted __le32 [usertype] *
arch/arm64/net/bpf_jit_comp.c:1700:52:
error: subtraction of different types can't work (different base
types)
arch/arm64/net/bpf_jit_comp.c:1734:29:
warning: incorrect type in assignment (different base types)
arch/arm64/net/bpf_jit_comp.c:1734:29:
expected unsigned int [usertype] *
arch/arm64/net/bpf_jit_comp.c:1734:29:
got restricted __le32 [usertype] *
arch/arm64/net/bpf_jit_comp.c:1918:52:
error: subtraction of different types can't work (different base
types)

This is because the variable branch in function invoke_bpf_prog and the
variable branches in function prepare_trampoline are defined as type
u32 *, which conflicts with ctx->image's type __le32 *, so sparse complains
when assignment or arithmetic operation are performed on these two
variables and ctx->image.

Since arm64 instructions are always little-endian, change the type of
these two variables to __le32 * and call cpu_to_le32() to convert
instruction to little-endian before writing it to memory. This is also
in line with emit() which internally does cpu_to_le32(), too.

Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/bpf/20220808040735.1232002-1-xukuohai@huawei.com
2 years agohwmon: (nct6775) Fix platform driver suspend regression
Zev Weiss [Wed, 10 Aug 2022 05:26:46 +0000 (22:26 -0700)]
hwmon: (nct6775) Fix platform driver suspend regression

Commit c3963bc0a0cf ("hwmon: (nct6775) Split core and platform
driver") introduced a slight change in nct6775_suspend() in order to
avoid an otherwise-needless symbol export for nct6775_update_device(),
replacing a call to that function with a simple dev_get_drvdata()
instead.

As it turns out, there is no guarantee that nct6775_update_device()
is ever called prior to suspend. If this happens, the resume function
ends up writing bad data into the various chip registers, which results
in a crash shortly after resume.

To fix the problem, just add the symbol export and return to using
nct6775_update_device() as was employed previously.

Reported-by: Zoltán Kővágó <dirty.ice.hu@gmail.com>
Tested-by: Zoltán Kővágó <dirty.ice.hu@gmail.com>
Fixes: c3963bc0a0cf ("hwmon: (nct6775) Split core and platform driver")
Cc: stable@kernel.org
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Link: https://lore.kernel.org/r/20220810052646.13825-1-zev@bewilderbeest.net
[groeck: Updated description]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2 years agogenetlink: correct uAPI defines
Jakub Kicinski [Tue, 9 Aug 2022 23:27:40 +0000 (16:27 -0700)]
genetlink: correct uAPI defines

Commit 50a896cf2d6f ("genetlink: properly support per-op policy dumping")
seems to have copy'n'pasted things a little incorrectly.

The #define CTRL_ATTR_MCAST_GRP_MAX should have stayed right
after the previous enum. The new CTRL_ATTR_POLICY_* needs
its own define for MAX and that max should not contain the
superfluous _DUMP in the name.

We probably can't do anything about the CTRL_ATTR_POLICY_DUMP_MAX
any more, there's likely code which uses it. For consistency
(*cough* codegen *cough*) let's add the correctly name define
nonetheless.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2 years agodevlink: Fix use-after-free after a failed reload
Ido Schimmel [Tue, 9 Aug 2022 11:35:06 +0000 (14:35 +0300)]
devlink: Fix use-after-free after a failed reload

After a failed devlink reload, devlink parameters are still registered,
which means user space can set and get their values. In the case of the
mlxsw "acl_region_rehash_interval" parameter, these operations will
trigger a use-after-free [1].

Fix this by rejecting set and get operations while in the failed state.
Return the "-EOPNOTSUPP" error code which does not abort the parameters
dump, but instead causes it to skip over the problematic parameter.

Another possible fix is to perform these checks in the mlxsw parameter
callbacks, but other drivers might be affected by the same problem and I
am not aware of scenarios where these stricter checks will cause a
regression.

[1]
mlxsw_spectrum3 0000:00:10.0: Port 125: Failed to register netdev
mlxsw_spectrum3 0000:00:10.0: Failed to create ports

==================================================================
BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xbd/0xd0 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:904
Read of size 4 at addr ffff8880099dcfd8 by task kworker/u4:4/777

CPU: 1 PID: 777 Comm: kworker/u4:4 Not tainted 5.19.0-rc7-custom-126601-gfe26f28c586d #1
Hardware name: QEMU MSN4700, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x92/0xbd lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:313 [inline]
 print_report.cold+0x5e/0x5cf mm/kasan/report.c:429
 kasan_report+0xb9/0xf0 mm/kasan/report.c:491
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report_generic.c:306
 mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xbd/0xd0 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:904
 mlxsw_sp_acl_region_rehash_intrvl_get+0x49/0x60 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c:1106
 mlxsw_sp_params_acl_region_rehash_intrvl_get+0x33/0x80 drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3854
 devlink_param_get net/core/devlink.c:4981 [inline]
 devlink_nl_param_fill+0x238/0x12d0 net/core/devlink.c:5089
 devlink_param_notify+0xe5/0x230 net/core/devlink.c:5168
 devlink_ns_change_notify net/core/devlink.c:4417 [inline]
 devlink_ns_change_notify net/core/devlink.c:4396 [inline]
 devlink_reload+0x15f/0x700 net/core/devlink.c:4507
 devlink_pernet_pre_exit+0x112/0x1d0 net/core/devlink.c:12272
 ops_pre_exit_list net/core/net_namespace.c:152 [inline]
 cleanup_net+0x494/0xc00 net/core/net_namespace.c:582
 process_one_work+0x9fc/0x1710 kernel/workqueue.c:2289
 worker_thread+0x675/0x10b0 kernel/workqueue.c:2436
 kthread+0x30c/0x3d0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
 </TASK>

The buggy address belongs to the physical page:
page:ffffea0000267700 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x99dc
flags: 0x100000000000000(node=0|zone=1)
raw: 0100000000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880099dce80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff8880099dcf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff8880099dcf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                    ^
 ffff8880099dd000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff8880099dd080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================

Fixes: 98bbf70c1c41 ("mlxsw: spectrum: add "acl_region_rehash_interval" devlink param")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>