platform/kernel/linux-rpi.git
14 months agoMerge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD
Paolo Bonzini [Wed, 26 Apr 2023 19:50:01 +0000 (15:50 -0400)]
Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.4:

 - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
   guest is only adding new PTEs

 - Overhaul .sync_page() and .invlpg() to share the .sync_page()
   implementation, i.e. utilize .sync_page()'s optimizations when emulating
   invalidations

 - Clean up the range-based flushing APIs

 - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
   A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
   changed SPTE" overhead associated with writing the entire entry

 - Track the number of "tail" entries in a pte_list_desc to avoid having
   to walk (potentially) all descriptors during insertion and deletion,
   which gets quite expensive if the guest is spamming fork()

 - Misc cleanups

14 months agoMerge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEAD
Paolo Bonzini [Wed, 26 Apr 2023 19:49:23 +0000 (15:49 -0400)]
Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM x86 changes for 6.4:

 - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
   and by giving the guest control of CR0.WP when EPT is enabled on VMX
   (VMX-only because SVM doesn't support per-bit controls)

 - Add CR0/CR4 helpers to query single bits, and clean up related code
   where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
   as a bool

 - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

 - Misc cleanups

14 months agoMerge tag 'kvm-x86-generic-6.4' of https://github.com/kvm-x86/linux into HEAD
Paolo Bonzini [Wed, 26 Apr 2023 19:48:44 +0000 (15:48 -0400)]
Merge tag 'kvm-x86-generic-6.4' of https://github.com/kvm-x86/linux into HEAD

Common KVM changes for 6.4:

 - Drop unnecessary casts from "void *" throughout kvm_main.c

 - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
   size by 8 bytes on 64-bit kernels by utilizing a padding hole

 - Fix a documentation format goof that was introduced when the KVM docs
   were converted to ReST

 - Constify MIPS's internal callbacks (a leftover from the hardware enabling
   rework that landed in 6.3)

14 months agoMerge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm...
Paolo Bonzini [Wed, 26 Apr 2023 19:46:52 +0000 (15:46 -0400)]
Merge tag 'kvmarm-6.4' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.4

- Numerous fixes for the pathological lock inversion issue that
  plagued KVM/arm64 since... forever.

- New framework allowing SMCCC-compliant hypercalls to be forwarded
  to userspace, hopefully paving the way for some more features
  being moved to VMMs rather than be implemented in the kernel.

- Large rework of the timer code to allow a VM-wide offset to be
  applied to both virtual and physical counters as well as a
  per-timer, per-vcpu offset that complements the global one.
  This last part allows the NV timer code to be implemented on
  top.

- A small set of fixes to make sure that we don't change anything
  affecting the EL1&0 translation regime just after having having
  taken an exception to EL2 until we have executed a DSB. This
  ensures that speculative walks started in EL1&0 have completed.

- The usual selftest fixes and improvements.

14 months agoMerge tag 'kvm-s390-next-6.4-1' of https://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Bonzini [Wed, 26 Apr 2023 19:43:15 +0000 (15:43 -0400)]
Merge tag 'kvm-s390-next-6.4-1' of https://git./linux/kernel/git/kvms390/linux into HEAD

Minor cleanup:
 - phys_to_virt conversion
 - Improvement of VSIE AP management

15 months agoMerge branch kvm-arm64/spec-ptw into kvmarm-master/next
Marc Zyngier [Fri, 21 Apr 2023 08:44:58 +0000 (09:44 +0100)]
Merge branch kvm-arm64/spec-ptw into kvmarm-master/next

* kvm-arm64/spec-ptw:
  : .
  : On taking an exception from EL1&0 to EL2(&0), the page table walker is
  : allowed to carry on with speculative walks started from EL1&0 while
  : running at EL2 (see R_LFHQG). Given that the PTW may be actively using
  : the EL1&0 system registers, the only safe way to deal with it is to
  : issue a DSB before changing any of it.
  :
  : We already did the right thing for SPE and TRBE, but ignored the PTW
  : for unknown reasons (probably because the architecture wasn't crystal
  : clear at the time).
  :
  : This requires a bit of surgery in the nvhe code, though most of these
  : patches are comments so that my future self can understand the purpose
  : of these barriers. The VHE code is largely unaffected, thanks to the
  : DSB in the context switch.
  : .
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run

Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoMerge branch kvm-arm64/smccc-filtering into kvmarm-master/next
Marc Zyngier [Fri, 21 Apr 2023 08:43:38 +0000 (09:43 +0100)]
Merge branch kvm-arm64/smccc-filtering into kvmarm-master/next

* kvm-arm64/smccc-filtering:
  : .
  : SMCCC call filtering and forwarding to userspace, courtesy of
  : Oliver Upton. From the cover letter:
  :
  : "The Arm SMCCC is rather prescriptive in regards to the allocation of
  : SMCCC function ID ranges. Many of the hypercall ranges have an
  : associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some
  : room for vendor-specific implementations.
  :
  : The ever-expanding SMCCC surface leaves a lot of work within KVM for
  : providing new features. Furthermore, KVM implements its own
  : vendor-specific ABI, with little room for other implementations (like
  : Hyper-V, for example). Rather than cramming it all into the kernel we
  : should provide a way for userspace to handle hypercalls."
  : .
  KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
  KVM: arm64: Test that SMC64 arch calls are reserved
  KVM: arm64: Prevent userspace from handling SMC64 arch range
  KVM: arm64: Expose SMC/HVC width to userspace
  KVM: selftests: Add test for SMCCC filter
  KVM: selftests: Add a helper for SMCCC calls with SMC instruction
  KVM: arm64: Let errors from SMCCC emulation to reach userspace
  KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
  KVM: arm64: Introduce support for userspace SMCCC filtering
  KVM: arm64: Add support for KVM_EXIT_HYPERCALL
  KVM: arm64: Use a maple tree to represent the SMCCC filter
  KVM: arm64: Refactor hvc filtering to support different actions
  KVM: arm64: Start handling SMCs from EL1
  KVM: arm64: Rename SMC/HVC call handler to reflect reality
  KVM: arm64: Add vm fd device attribute accessors
  KVM: arm64: Add a helper to check if a VM has ran once
  KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL

Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoMerge branch kvm-arm64/selftest/misc-6.4 into kvmarm-master/next
Marc Zyngier [Fri, 21 Apr 2023 08:39:07 +0000 (09:39 +0100)]
Merge branch kvm-arm64/selftest/misc-6.4 into kvmarm-master/next

* kvm-arm64/selftest/misc-6.4:
  : .
  : Misc selftest updates for 6.4
  :
  : - Add comments for recently added ID registers
  : .
  KVM: selftests: Comment newly defined aarch64 ID registers

Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoMerge branch kvm-arm64/selftest/lpa into kvmarm-master/next
Marc Zyngier [Fri, 21 Apr 2023 08:37:36 +0000 (09:37 +0100)]
Merge branch kvm-arm64/selftest/lpa into kvmarm-master/next

* kvm-arm64/selftest/lpa:
  : .
  : Selftest fixes addressing PTE and TTBR0_EL1 encodings for
  : 52bit PAs
  : .
  KVM: selftests: arm64: Fix ttbr0_el1 encoding for PA bits > 48
  KVM: selftests: arm64: Fix pte encode/decode for PA bits > 48
  KVM: selftests: Fixup config fragment for access_tracking_perf_test

Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoMerge branch kvm-arm64/timer-vm-offsets into kvmarm-master/next
Marc Zyngier [Fri, 21 Apr 2023 08:31:17 +0000 (09:31 +0100)]
Merge branch kvm-arm64/timer-vm-offsets into kvmarm-master/next

* kvm-arm64/timer-vm-offsets: (21 commits)
  : .
  : This series aims at satisfying multiple goals:
  :
  : - allow a VMM to atomically restore a timer offset for a whole VM
  :   instead of updating the offset each time a vcpu get its counter
  :   written
  :
  : - allow a VMM to save/restore the physical timer context, something
  :   that we cannot do at the moment due to the lack of offsetting
  :
  : - provide a framework that is suitable for NV support, where we get
  :   both global and per timer, per vcpu offsetting, and manage
  :   interrupts in a less braindead way.
  :
  : Conflict resolution involves using the new per-vcpu config lock instead
  : of the home-grown timer lock.
  : .
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: selftests: Augment existing timer test to handle variable offset
  KVM: arm64: selftests: Deal with spurious timer interrupts
  KVM: arm64: selftests: Add physical timer registers to the sysreg list
  KVM: arm64: nv: timers: Support hyp timer emulation
  KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
  KVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co
  KVM: arm64: timers: Abstract the number of valid timers per vcpu
  KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
  KVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor
  KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
  KVM: arm64: timers: Abstract per-timer IRQ access
  KVM: arm64: timers: Rationalise per-vcpu timer init
  KVM: arm64: timers: Allow save/restoring of the physical timer
  KVM: arm64: timers: Allow userspace to set the global counter offset
  KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
  KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
  KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
  arm64: Add HAS_ECV_CNTPOFF capability
  arm64: Add CNTPOFF_EL2 register definition
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoMerge branch kvm-arm64/lock-inversion into kvmarm-master/next
Marc Zyngier [Fri, 21 Apr 2023 08:30:46 +0000 (09:30 +0100)]
Merge branch kvm-arm64/lock-inversion into kvmarm-master/next

* kvm-arm64/lock-inversion:
  : .
  : vm/vcpu lock inversion fixes, courtesy of Oliver Upton, plus a few
  : extra fixes from both Oliver and Reiji Watanabe.
  :
  : From the initial cover letter:
  :
  : As it so happens, lock ordering in KVM/arm64 is completely backwards.
  : There's a significant amount of VM-wide state that needs to be accessed
  : from the context of a vCPU. Until now, this was accomplished by
  : acquiring the kvm->lock, but that cannot be nested within vcpu->mutex.
  :
  : This series fixes the issue with some fine-grained locking for MP state
  : and a new, dedicated mutex that can nest with both kvm->lock and
  : vcpu->mutex.
  : .
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: arm64: Use config_lock to protect vgic state
  KVM: arm64: Use config_lock to protect data ordered against KVM_RUN
  KVM: arm64: Avoid lock inversion when setting the VM register width
  KVM: arm64: Avoid vcpu->mutex v. kvm->lock inversion in CPU_ON

Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoKVM: s390: pci: fix virtual-physical confusion on module unload/load
Nico Boehr [Wed, 22 Feb 2023 15:55:02 +0000 (16:55 +0100)]
KVM: s390: pci: fix virtual-physical confusion on module unload/load

When the kvm module is unloaded, zpci_setup_aipb() perists some data in the
zpci_aipb structure in s390 pci code. Note that this struct is also passed
to firmware in the zpci_set_irq_ctrl() call and thus the GAIT must be a
physical address.

On module re-insertion, the GAIT is restored from this structure in
zpci_reset_aipb(). But it is a physical address, hence this may cause
issues when the kvm module is unloaded and loaded again.

Fix virtual vs physical address confusion (which currently are the same) by
adding the necessary physical-to-virtual-conversion in zpci_reset_aipb().

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230222155503.43399-1-nrb@linux.ibm.com
Message-Id: <20230222155503.43399-1-nrb@linux.ibm.com>

15 months agoKVM: s390: vsie: clarifications on setting the APCB
Pierre Morel [Tue, 14 Feb 2023 12:28:41 +0000 (13:28 +0100)]
KVM: s390: vsie: clarifications on setting the APCB

The APCB is part of the CRYCB.
The calculation of the APCB origin can be done by adding
the APCB offset to the CRYCB origin.

Current code makes confusing transformations, converting
the CRYCB origin to a pointer to calculate the APCB origin.

Let's make things simpler and keep the CRYCB origin to make
these calculations.

Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230214122841.13066-2-pmorel@linux.ibm.com
Message-Id: <20230214122841.13066-2-pmorel@linux.ibm.com>

15 months agoKVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
Nico Boehr [Thu, 23 Feb 2023 16:22:36 +0000 (17:22 +0100)]
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA

We sometimes put a virtual address in next_alert, which should always be
a physical address, since it is shared with hardware.

This currently works, because virtual and physical addresses are
the same.

Add phys_to_virt() to resolve the virtual-physical confusion.

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Michael Mueller <mimu@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230223162236.51569-1-nrb@linux.ibm.com
Message-Id: <20230223162236.51569-1-nrb@linux.ibm.com>

15 months agoKVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
Reiji Watanabe [Wed, 19 Apr 2023 02:18:52 +0000 (19:18 -0700)]
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state

All accessors of kvm_vcpu_arch::mp_state should be {READ,WRITE}_ONCE(),
since readers of the mp_state don't acquire the mp_state_lock.
Nonetheless, kvm_psci_vcpu_on() updates the mp_state without using
WRITE_ONCE(). So, fix the code to update the mp_state using WRITE_ONCE.

Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-3-reijiw@google.com
15 months agoKVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
Reiji Watanabe [Wed, 19 Apr 2023 02:18:51 +0000 (19:18 -0700)]
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()

kvm_arch_vcpu_ioctl_vcpu_init() doesn't acquire mp_state_lock
when setting the mp_state to KVM_MP_STATE_RUNNABLE. Fix the
code to acquire the lock.

Signed-off-by: Reiji Watanabe <reijiw@google.com>
[maz: minor refactor]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-2-reijiw@google.com
15 months agoKVM: arm64: vhe: Drop extra isb() on guest exit
Marc Zyngier [Sat, 8 Apr 2023 16:04:27 +0000 (17:04 +0100)]
KVM: arm64: vhe: Drop extra isb() on guest exit

__kvm_vcpu_run_vhe() end on VHE with an isb(). However, this
function is only reachable via kvm_call_hyp_ret(), which already
contains an isb() in order to mimick the behaviour of nVHE and
provide a context synchronisation event.

We thus have two isb()s back to back, which is one too many.
Drop the first one and solely rely on the one in the helper.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
15 months agoKVM: arm64: vhe: Synchronise with page table walker on MMU update
Marc Zyngier [Sat, 8 Apr 2023 16:04:26 +0000 (17:04 +0100)]
KVM: arm64: vhe: Synchronise with page table walker on MMU update

Contrary to nVHE, VHE is a lot easier when it comes to dealing
with speculative page table walks started at EL1. As we only change
EL1&0 translation regime when context-switching, we already benefit
from the effect of the DSB that sits in the context switch code.

We only need to take care of it in the NV case, where we can
flip between between two EL1 contexts (one of them being the virtual
EL2) without a context switch.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
15 months agoKVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
Marc Zyngier [Sat, 8 Apr 2023 16:04:25 +0000 (17:04 +0100)]
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()

We rely on the presence of a DSB at the end of kvm_flush_dcache_to_poc()
that, on top of ensuring completion of the cache clean, also covers
the speculative page table walk started from EL1.

Document this dependency.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
15 months agoKVM: arm64: nvhe: Synchronise with page table walker on TLBI
Marc Zyngier [Sat, 8 Apr 2023 16:04:24 +0000 (17:04 +0100)]
KVM: arm64: nvhe: Synchronise with page table walker on TLBI

A TLBI from EL2 impacting EL1 involves messing with the EL1&0
translation regime, and the page table walker may still be
performing speculative walks.

Piggyback on the existing DSBs to always have a DSB ISH that
will synchronise all load/store operations that the PTW may
still have.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoKVM: arm64: Handle 32bit CNTPCTSS traps
Marc Zyngier [Thu, 13 Apr 2023 13:23:42 +0000 (14:23 +0100)]
KVM: arm64: Handle 32bit CNTPCTSS traps

When CNTPOFF isn't implemented and that we have a non-zero counter
offset, CNTPCT and CNTPCTSS are trapped. We properly handle the
former, but not the latter, as it is not present in the sysreg
table (despite being actually handled in the code). Bummer.

Just populate the cp15_64 table with the missing register.

Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
15 months agoKVM: arm64: nvhe: Synchronise with page table walker on vcpu run
Marc Zyngier [Sat, 8 Apr 2023 16:04:23 +0000 (17:04 +0100)]
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run

When taking an exception between the EL1&0 translation regime and
the EL2 translation regime, the page table walker is allowed to
complete the walks started from EL0 or EL1 while running at EL2.

It means that altering the system registers that define the EL1&0
translation regime is fraught with danger *unless* we wait for
the completion of such walk with a DSB (R_LFHQG and subsequent
statements in the ARM ARM). We already did the right thing for
other external agents (SPE, TRBE), but not the PTW.

Rework the existing SPE/TRBE synchronisation to include the PTW,
and add the missing DSB on guest exit.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
15 months agoKVM: arm64: vgic: Don't acquire its_lock before config_lock
Oliver Upton [Wed, 12 Apr 2023 06:27:33 +0000 (06:27 +0000)]
KVM: arm64: vgic: Don't acquire its_lock before config_lock

commit f00327731131 ("KVM: arm64: Use config_lock to protect vgic
state") was meant to rectify a longstanding lock ordering issue in KVM
where the kvm->lock is taken while holding vcpu->mutex. As it so
happens, the aforementioned commit introduced yet another locking issue
by acquiring the its_lock before acquiring the config lock.

This is obviously wrong, especially considering that the lock ordering
is well documented in vgic.c. Reshuffle the locks once more to take the
config_lock before the its_lock. While at it, sprinkle in the lockdep
hinting that has become popular as of late to keep lockdep apprised of
our ordering.

Cc: stable@vger.kernel.org
Fixes: f00327731131 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230412062733.988229-1-oliver.upton@linux.dev
15 months agoKVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults
Sean Christopherson [Wed, 5 Apr 2023 00:26:08 +0000 (17:26 -0700)]
KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults

Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for
permission faults when emulating a guest memory access and CR0.WP may be
guest owned.  If the guest toggles only CR0.WP and triggers emulation of
a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a
stale CR0.WP, i.e. use stale protection bits metadata.

Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP
is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't
support per-bit interception controls for CR0.  Don't bother checking for
EPT vs. NPT as the "old == new" check will always be true under NPT, i.e.
the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0
from the VMCB on VM-Exit).

Reported-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net
Fixes: fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit")
Tested-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V code
Sean Christopherson [Wed, 5 Apr 2023 00:31:33 +0000 (17:31 -0700)]
KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V code

Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages
pair instead of a struct, and bury said struct in Hyper-V specific code.

Passing along two params generates much better code for the common case
where KVM is _not_ running on Hyper-V, as forwarding the flush on to
Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range()
becomes a tail call.

Cc: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86: Rename Hyper-V remote TLB hooks to match established scheme
Sean Christopherson [Wed, 5 Apr 2023 00:31:32 +0000 (17:31 -0700)]
KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme

Rename the Hyper-V hooks for TLB flushing to match the naming scheme used
by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code,
arch hooks from common code, etc.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
Colin Ian King [Thu, 6 Apr 2023 08:02:26 +0000 (09:02 +0100)]
KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"

There is a spelling mistake in a test assert message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230406080226.122955-1-colin.i.king@gmail.com
15 months agoKVM: arm64: Test that SMC64 arch calls are reserved
Oliver Upton [Sat, 8 Apr 2023 12:17:32 +0000 (12:17 +0000)]
KVM: arm64: Test that SMC64 arch calls are reserved

Assert that the SMC64 view of the Arm architecture range is reserved by
KVM and cannot be filtered by userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230408121732.3411329-3-oliver.upton@linux.dev
15 months agoKVM: arm64: Prevent userspace from handling SMC64 arch range
Oliver Upton [Sat, 8 Apr 2023 12:17:31 +0000 (12:17 +0000)]
KVM: arm64: Prevent userspace from handling SMC64 arch range

Though presently unused, there is an SMC64 view of the Arm architecture
calls defined by the SMCCC. The documentation of the SMCCC filter states
that the SMC64 range is reserved, but nothing actually prevents
userspace from applying a filter to the range.

Insert a range with the HANDLE action for the SMC64 arch range, thereby
preventing userspace from imposing filtering/forwarding on it.

Fixes: fb88707dd39b ("KVM: arm64: Use a maple tree to represent the SMCCC filter")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230408121732.3411329-2-oliver.upton@linux.dev
15 months agoKVM: SVM: Return the local "r" variable from svm_set_msr()
Sean Christopherson [Wed, 22 Mar 2023 01:14:40 +0000 (18:14 -0700)]
KVM: SVM: Return the local "r" variable from svm_set_msr()

Rename "r" to "ret" and actually return it from svm_set_msr() to reduce
the probability of repeating the mistake of commit 723d5fb0ffe4 ("kvm:
svm: Add IA32_FLUSH_CMD guest support"), which set "r" thinking that it
would be propagated to the caller.

Alternatively, the declaration of "r" could be moved into the handling of
MSR_TSC_AUX, but that risks variable shadowing in the future.  A wrapper
for kvm_set_user_return_msr() would allow eliding a local variable, but
that feels like delaying the inevitable.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
15 months agoKVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD
Sean Christopherson [Wed, 22 Mar 2023 01:14:39 +0000 (18:14 -0700)]
KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD

Virtualize FLUSH_L1D so that the guest can use the performant L1D flush
if one of the many mitigations might require a flush in the guest, e.g.
Linux provides an option to flush the L1D when switching mms.

Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware
and exposed to the guest, i.e. always let the guest write it directly if
FLUSH_L1D is fully supported.

Forward writes to hardware in host context on the off chance that KVM
ends up emulating a WRMSR, or in the really unlikely scenario where
userspace wants to force a flush.  Restrict these forwarded WRMSRs to
the known command out of an abundance of caution.  Passing through the
MSR means the guest can throw any and all values at hardware, but doing
so in host context is arguably a bit more dangerous.

Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com
Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
15 months agoKVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code
Sean Christopherson [Wed, 22 Mar 2023 01:14:38 +0000 (18:14 -0700)]
KVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code

Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the
logic to kvm_set_msr_common().  Now that the MSR interception toggling is
handled as part of setting guest CPUID, the VMX and SVM paths are
identical.

Opportunistically massage the code to make it a wee bit denser.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
15 months agoKVM: SVM: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Sean Christopherson [Wed, 22 Mar 2023 01:14:37 +0000 (18:14 -0700)]
KVM: SVM: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID

Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
15 months agoKVM: VMX: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID
Sean Christopherson [Wed, 22 Mar 2023 01:14:36 +0000 (18:14 -0700)]
KVM: VMX: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID

Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is
supported and enabled, i.e. don't wait until the first write.  There's no
benefit to deferred passthrough, and the extra logic only adds complexity.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
15 months agoKVM: x86: Revert MSR_IA32_FLUSH_CMD.FLUSH_L1D enabling
Sean Christopherson [Wed, 22 Mar 2023 01:14:35 +0000 (18:14 -0700)]
KVM: x86: Revert MSR_IA32_FLUSH_CMD.FLUSH_L1D enabling

Revert the recently added virtualizing of MSR_IA32_FLUSH_CMD, as both
the VMX and SVM are fatally buggy to guests that use MSR_IA32_FLUSH_CMD or
MSR_IA32_PRED_CMD, and because the entire foundation of the logic is
flawed.

The most immediate problem is an inverted check on @cmd that results in
rejecting legal values.  SVM doubles down on bugs and drops the error,
i.e. silently breaks all guest mitigations based on the command MSRs.

The next issue is that neither VMX nor SVM was updated to mark
MSR_IA32_FLUSH_CMD as being a possible passthrough MSR,
which isn't hugely problematic, but does break MSR filtering and triggers
a WARN on VMX designed to catch this exact bug.

The foundational issues stem from the MSR_IA32_FLUSH_CMD code reusing
logic from MSR_IA32_PRED_CMD, which in turn was likely copied from KVM's
support for MSR_IA32_SPEC_CTRL.  The copy+paste from MSR_IA32_SPEC_CTRL
was misguided as MSR_IA32_PRED_CMD (and MSR_IA32_FLUSH_CMD) is a
write-only MSR, i.e. doesn't need the same "deferred passthrough"
shenanigans as MSR_IA32_SPEC_CTRL.

Revert all MSR_IA32_FLUSH_CMD enabling in one fell swoop so that there is
no point where KVM advertises, but does not support, L1D_FLUSH.

This reverts commits 45cf86f26148e549c5ba4a8ab32a390e4bde216e,
723d5fb0ffe4c02bd4edf47ea02c02e454719f28, and
a807b78ad04b2eaa348f52f5cc7702385b6de1ee.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lkml.kernel.org/r/20230317190432.GA863767%40dev-arch.thelio-3990X
Cc: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Message-Id: <20230322011440.2195485-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
15 months agoKVM: x86: set "mitigate_smt_rsb" storage-class-specifier to static
Tom Rix [Tue, 4 Apr 2023 01:01:41 +0000 (21:01 -0400)]
KVM: x86: set "mitigate_smt_rsb" storage-class-specifier to static

smatch reports
arch/x86/kvm/x86.c:199:20: warning: symbol
  'mitigate_smt_rsb' was not declared. Should it be static?

This variable is only used in one file so it should be static.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20230404010141.1913667-1-trix@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: arm64: Expose SMC/HVC width to userspace
Marc Zyngier [Wed, 5 Apr 2023 11:48:58 +0000 (12:48 +0100)]
KVM: arm64: Expose SMC/HVC width to userspace

When returning to userspace to handle a SMCCC call, we consistently
set PC to point to the instruction immediately after the HVC/SMC.

However, should userspace need to know the exact address of the
trapping instruction, it needs to know about the *size* of that
instruction. For AArch64, this is pretty easy. For AArch32, this
is a bit more funky, as Thumb has 16bit encodings for both HVC
and SMC.

Expose this to userspace with a new flag that directly derives
from ESR_EL2.IL. Also update the documentation to reflect the PC
state at the point of exit.

Finally, this fixes a small buglet where the hypercall.{args,ret}
fields would not be cleared on exit, and could contain some
random junk.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/86pm8iv8tj.wl-maz@kernel.org
15 months agoKVM: selftests: Add test for SMCCC filter
Oliver Upton [Tue, 4 Apr 2023 15:40:50 +0000 (15:40 +0000)]
KVM: selftests: Add test for SMCCC filter

Add a selftest for the SMCCC filter, ensuring basic UAPI constraints
(e.g. reserved ranges, non-overlapping ranges) are upheld. Additionally,
test that the DENIED and FWD_TO_USER work as intended.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-14-oliver.upton@linux.dev
15 months agoKVM: selftests: Add a helper for SMCCC calls with SMC instruction
Oliver Upton [Tue, 4 Apr 2023 15:40:49 +0000 (15:40 +0000)]
KVM: selftests: Add a helper for SMCCC calls with SMC instruction

Build a helper for doing SMCs in selftests by macro-izing the current
HVC implementation and taking the conduit instruction as an argument.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-13-oliver.upton@linux.dev
15 months agoKVM: arm64: Let errors from SMCCC emulation to reach userspace
Oliver Upton [Tue, 4 Apr 2023 15:40:48 +0000 (15:40 +0000)]
KVM: arm64: Let errors from SMCCC emulation to reach userspace

Typically a negative return from an exit handler is used to request a
return to userspace with the specified error. KVM's handling of SMCCC
emulation (i.e. both HVCs and SMCs) deviates from the trend and resumes
the guest instead.

Stop handling negative returns this way and instead let the error
percolate to userspace.

Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-12-oliver.upton@linux.dev
15 months agoKVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
Oliver Upton [Tue, 4 Apr 2023 15:40:47 +0000 (15:40 +0000)]
KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version

A subsequent change to KVM will allow negative returns from SMCCC
handlers to exit to userspace. Make way for this change by explicitly
returning SMCCC_RET_NOT_SUPPORTED to the guest if the VM is configured
to use an unknown PSCI version. Add a WARN since this is undoubtedly a
KVM bug.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-11-oliver.upton@linux.dev
15 months agoKVM: arm64: Introduce support for userspace SMCCC filtering
Oliver Upton [Tue, 4 Apr 2023 15:40:46 +0000 (15:40 +0000)]
KVM: arm64: Introduce support for userspace SMCCC filtering

As the SMCCC (and related specifications) march towards an 'everything
and the kitchen sink' interface for interacting with a system it becomes
less likely that KVM will support every related feature. We could do
better by letting userspace have a crack at it instead.

Allow userspace to define an 'SMCCC filter' that applies to both HVCs
and SMCs initiated by the guest. Supporting both conduits with this
interface is important for a couple of reasons. Guest SMC usage is table
stakes for a nested guest, as HVCs are always taken to the virtual EL2.
Additionally, guests may want to interact with a service on the secure
side which can now be proxied by userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-10-oliver.upton@linux.dev
15 months agoKVM: arm64: Add support for KVM_EXIT_HYPERCALL
Oliver Upton [Tue, 4 Apr 2023 15:40:45 +0000 (15:40 +0000)]
KVM: arm64: Add support for KVM_EXIT_HYPERCALL

In anticipation of user hypercall filters, add the necessary plumbing to
get SMCCC calls out to userspace. Even though the exit structure has
space for KVM to pass register arguments, let's just avoid it altogether
and let userspace poke at the registers via KVM_GET_ONE_REG.

This deliberately stretches the definition of a 'hypercall' to cover
SMCs from EL1 in addition to the HVCs we know and love. KVM doesn't
support EL1 calls into secure services, but now we can paint that as a
userspace problem and be done with it.

Finally, we need a flag to let userspace know what conduit instruction
was used (i.e. SMC vs. HVC).

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-9-oliver.upton@linux.dev
15 months agoKVM: arm64: Use a maple tree to represent the SMCCC filter
Oliver Upton [Tue, 4 Apr 2023 15:40:44 +0000 (15:40 +0000)]
KVM: arm64: Use a maple tree to represent the SMCCC filter

Maple tree is an efficient B-tree implementation that is intended for
storing non-overlapping intervals. Such a data structure is a good fit
for the SMCCC filter as it is desirable to sparsely allocate the 32 bit
function ID space.

To that end, add a maple tree to kvm_arch and correctly init/teardown
along with the VM. Wire in a test against the hypercall filter for HVCs
which does nothing until the controls are exposed to userspace.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-8-oliver.upton@linux.dev
15 months agoKVM: arm64: Refactor hvc filtering to support different actions
Oliver Upton [Tue, 4 Apr 2023 15:40:43 +0000 (15:40 +0000)]
KVM: arm64: Refactor hvc filtering to support different actions

KVM presently allows userspace to filter guest hypercalls with bitmaps
expressed via pseudo-firmware registers. These bitmaps have a narrow
scope and, of course, can only allow/deny a particular call. A
subsequent change to KVM will introduce a generalized UAPI for filtering
hypercalls, allowing functions to be forwarded to userspace.

Refactor the existing hypercall filtering logic to make room for more
than two actions. While at it, generalize the function names around
SMCCC as it is the basis for the upcoming UAPI.

No functional change intended.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-7-oliver.upton@linux.dev
15 months agoKVM: arm64: Start handling SMCs from EL1
Oliver Upton [Tue, 4 Apr 2023 15:40:42 +0000 (15:40 +0000)]
KVM: arm64: Start handling SMCs from EL1

Whelp, the architecture gods have spoken and confirmed that the function
ID space is common between SMCs and HVCs. Not only that, the expectation
is that hypervisors handle calls to both SMC and HVC conduits. KVM
recently picked up support for SMCCCs in commit bd36b1a9eb5a ("KVM:
arm64: nv: Handle SMCs taken from virtual EL2") but scoped it only to a
nested hypervisor.

Let's just open the floodgates and let EL1 access our SMCCC
implementation with the SMC instruction as well.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-6-oliver.upton@linux.dev
15 months agoKVM: arm64: Rename SMC/HVC call handler to reflect reality
Oliver Upton [Tue, 4 Apr 2023 15:40:41 +0000 (15:40 +0000)]
KVM: arm64: Rename SMC/HVC call handler to reflect reality

KVM handles SMCCC calls from virtual EL2 that use the SMC instruction
since commit bd36b1a9eb5a ("KVM: arm64: nv: Handle SMCs taken from
virtual EL2"). Thus, the function name of the handler no longer reflects
reality.

Normalize the name on SMCCC, since that's the only hypercall interface
KVM supports in the first place. No fuctional change intended.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-5-oliver.upton@linux.dev
15 months agoKVM: arm64: Add vm fd device attribute accessors
Oliver Upton [Tue, 4 Apr 2023 15:40:40 +0000 (15:40 +0000)]
KVM: arm64: Add vm fd device attribute accessors

A subsequent change will allow userspace to convey a filter for
hypercalls through a vm device attribute. Add the requisite boilerplate
for vm attribute accessors.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-4-oliver.upton@linux.dev
15 months agoKVM: arm64: Add a helper to check if a VM has ran once
Oliver Upton [Tue, 4 Apr 2023 15:40:39 +0000 (15:40 +0000)]
KVM: arm64: Add a helper to check if a VM has ran once

The test_bit(...) pattern is quite a lot of keystrokes. Replace
existing callsites with a helper.

No functional change intended.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-3-oliver.upton@linux.dev
15 months agoKVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
Oliver Upton [Tue, 4 Apr 2023 15:40:38 +0000 (15:40 +0000)]
KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL

The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.

Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
15 months agoKVM: x86/mmu: Merge all handle_changed_pte*() functions
Vipin Sharma [Tue, 21 Mar 2023 22:00:21 +0000 (15:00 -0700)]
KVM: x86/mmu: Merge all handle_changed_pte*() functions

Merge __handle_changed_pte() and handle_changed_spte_acc_track() into a
single function, handle_changed_pte(), as the two are always used
together.  Remove the existing handle_changed_pte(), as it's just a
wrapper that calls __handle_changed_pte() and
handle_changed_spte_acc_track().

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Remove handle_changed_spte_dirty_log()
Vipin Sharma [Tue, 21 Mar 2023 22:00:20 +0000 (15:00 -0700)]
KVM: x86/mmu: Remove handle_changed_spte_dirty_log()

Remove handle_changed_spte_dirty_log() as there is no code flow which
sets 4KiB SPTE writable and hit this path. This function marks the page
dirty in a memslot only if new SPTE is 4KiB in size and writable.

Current users of handle_changed_spte_dirty_log() are:
1. set_spte_gfn() - Create only non writable SPTEs.
2. write_protect_gfn() - Change an SPTE to non writable.
3. zap leaf and roots APIs - Everything is 0.
4. handle_removed_pt() - Sets SPTEs to REMOVED_SPTE
5. tdp_mmu_link_sp() - Makes non leaf SPTEs.

There is also no path which creates a writable 4KiB without going
through make_spte() and this functions takes care of marking SPTE dirty
in the memslot if it is PT_WRITABLE.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: add blurb to __handle_changed_spte()'s comment]
Link: https://lore.kernel.org/r/20230321220021.2119033-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte()
Vipin Sharma [Tue, 21 Mar 2023 22:00:19 +0000 (15:00 -0700)]
KVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte()

Remove bool parameter "record_acc_track" from __tdp_mmu_set_spte() and
refactor the code. This variable is always set to true by its caller.

Remove single and double underscore prefix from tdp_mmu_set_spte()
related APIs:
1. Change __tdp_mmu_set_spte() to tdp_mmu_set_spte()
2. Change _tdp_mmu_set_spte() to tdp_mmu_iter_set_spte()

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEs
Vipin Sharma [Tue, 21 Mar 2023 22:00:18 +0000 (15:00 -0700)]
KVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEs

Drop everything except the "tdp_mmu_spte_changed" tracepoint part of
__handle_changed_spte() when aging SPTEs in the TDP MMU, as clearing the
accessed status doesn't affect the SPTE's shadow-present status, whether
or not the SPTE is a leaf, or change the PFN.  I.e. none of the functional
updates handled by __handle_changed_spte() are relevant.

Losing __handle_changed_spte()'s sanity checks does mean that a bug could
theoretical go unnoticed, but that scenario is extremely unlikely, e.g.
would effectively require a misconfigured MMU or a locking bug elsewhere.

Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEs
Vipin Sharma [Tue, 21 Mar 2023 22:00:17 +0000 (15:00 -0700)]
KVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEs

Drop the unnecessary call to handle dirty log updates when aging TDP MMU
SPTEs, as neither clearing the Accessed bit nor marking a SPTE for access
tracking can _set_ the Writable bit, i.e. can't trigger marking a gfn
dirty in its memslot.  The access tracking path can _clear_ the Writable
bit, e.g. if the XCHG races with fast_page_fault() and writes the stale
value without the Writable bit set, but clearing the Writable bit outside
of mmu_lock is not allowed, i.e. access tracking can't spuriously set the
Writable bit.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty path, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEs
Vipin Sharma [Tue, 21 Mar 2023 22:00:16 +0000 (15:00 -0700)]
KVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEs

Use tdp_mmu_clear_spte_bits() when clearing the Accessed bit in TDP MMU
SPTEs so as to use an atomic-AND instead of XCHG to clear the A-bit.
Similar to the D-bit story, this will allow KVM to bypass
__handle_changed_spte() by ensuring only the A-bit is modified.

Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte()
Vipin Sharma [Tue, 21 Mar 2023 22:00:15 +0000 (15:00 -0700)]
KVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte()

Remove bool parameter "record_dirty_log" from __tdp_mmu_set_spte() and
refactor the code as this variable is always set to true by its caller.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bits
Vipin Sharma [Tue, 21 Mar 2023 22:00:14 +0000 (15:00 -0700)]
KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bits

Drop everything except marking the PFN dirty and the relevant tracepoint
parts of __handle_changed_spte() when clearing the dirty status of gfns in
the TDP MMU.  Clearing only the Dirty (or Writable) bit doesn't affect
the SPTEs shadow-present status, whether or not the SPTE is a leaf, or
change the SPTE's PFN.  I.e. other than marking the PFN dirty, none of the
functional updates handled by __handle_changed_spte() are relevant.

Losing __handle_changed_spte()'s sanity checks does mean that a bug could
theoretical go unnoticed, but that scenario is extremely unlikely, e.g.
would effectively require a misconfigured or a locking bug elsewhere.

Opportunistically remove a comment blurb from __handle_changed_spte()
about all modifications to TDP MMU SPTEs needing to invoke said function,
that "rule" hasn't been true since fast page fault support was added for
the TDP MMU (and perhaps even before).

Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of
clear dirty log stage improved by ~40% in dirty_log_perf_test (with the
full optimization applied).

Before optimization:
--------------------
Iteration 1 clear dirty log time: 3.638543593s
Iteration 2 clear dirty log time: 3.145032742s
Iteration 3 clear dirty log time: 3.142340358s
Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration)

After optimization:
-------------------
Iteration 1 clear dirty log time: 2.318988110s
Iteration 2 clear dirty log time: 1.794470164s
Iteration 3 clear dirty log time: 1.791668628s
Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration)

Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: split the switch to atomic-AND to a separate patch]
Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bits
Vipin Sharma [Tue, 21 Mar 2023 22:00:13 +0000 (15:00 -0700)]
KVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bits

Drop the unnecessary call to handle access-tracking changes when clearing
the dirty status of TDP MMU SPTEs.  Neither the Dirty bit nor the Writable
bit has any impact on the accessed state of a page, i.e. clearing only
the aforementioned bits doesn't make an accessed SPTE suddently not
accessed.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flow
Vipin Sharma [Tue, 21 Mar 2023 22:00:12 +0000 (15:00 -0700)]
KVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flow

Optimize the clearing of dirty state in TDP MMU SPTEs by doing an
atomic-AND (on SPTEs that have volatile bits) instead of the full XCHG
that currently ends up being invoked (see kvm_tdp_mmu_write_spte()).
Clearing _only_ the bit in question will allow KVM to skip the many
irrelevant checks in __handle_changed_spte() by avoiding any collateral
damage due to the XCHG writing all SPTE bits, e.g. the XCHG could race
with fast_page_fault() setting the W-bit and the CPU setting the D-bit,
and thus incorrectly drop the CPU's D-bit update.

Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: split the switch to atomic-AND to a separate patch]
Link: https://lore.kernel.org/r/20230321220021.2119033-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMU
Vipin Sharma [Tue, 21 Mar 2023 22:00:11 +0000 (15:00 -0700)]
KVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMU

Deduplicate the guts of the TDP MMU's clearing of dirty status by
snapshotting whether to check+clear the Dirty bit vs. the Writable bit,
which is the only difference between the two flavors of dirty tracking.

Note, kvm_ad_enabled() is just a wrapper for shadow_accessed_mask, i.e.
is constant after kvm-{intel,amd}.ko is loaded.

Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty log, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot
Vipin Sharma [Tue, 21 Mar 2023 22:00:10 +0000 (15:00 -0700)]
KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot

Use the constant-after-module-load kvm_ad_enabled() to check if SPTEs in
the TDP MMU need to be write-protected when clearing accessed/dirty status
instead of manually checking every SPTE.  The per-SPTE A/D enabling is
specific to nested EPT MMUs, i.e. when KVM is using EPT A/D bits but L1 is
not, and so cannot happen in the TDP MMU (which is non-nested only).

Keep the original code as sanity checks buried under MMU_WARN_ON().
MMU_WARN_ON() is more or less useless at the moment, but there are plans
to change that.

Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty path, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Add a helper function to check if an SPTE needs atomic write
Vipin Sharma [Tue, 21 Mar 2023 22:00:09 +0000 (15:00 -0700)]
KVM: x86/mmu: Add a helper function to check if an SPTE needs atomic write

Move conditions in kvm_tdp_mmu_write_spte() to check if an SPTE should
be written atomically or not to a separate function.

This new function, kvm_tdp_mmu_spte_need_atomic_write(),  will be used
in future commits to optimize clearing bits in SPTEs.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Comment newly defined aarch64 ID registers
Mark Brown [Mon, 6 Mar 2023 16:15:07 +0000 (16:15 +0000)]
KVM: selftests: Comment newly defined aarch64 ID registers

All otherwise unspecified aarch64 ID registers should be read as zero so
we cover the whole ID register space in the get-reg-list test but we've
added comments for those that have been named. Add comments for
ID_AA64PFR2_EL1, ID_AA64SMFR0_EL1, ID_AA64ISAR2_EL1, ID_AA64MMFR3_EL1
and ID_AA64MMFR4_EL1 which have been defined since the comments were
added so someone looking for them will see that they are covered.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230210-kvm-arm64-getreg-comments-v1-1-a16c73be5ab4@kernel.org
15 months agoKVM: selftests: arm64: Fix ttbr0_el1 encoding for PA bits > 48
Ryan Roberts [Wed, 8 Mar 2023 11:09:48 +0000 (11:09 +0000)]
KVM: selftests: arm64: Fix ttbr0_el1 encoding for PA bits > 48

Bits [51:48] of the pgd address are stored at bits [5:2] of ttbr0_el1.
page_table_test stores its page tables at the far end of IPA space so
was tripping over this when run on a system that supports FEAT_LPA (or
FEAT_LPA2).

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230308110948.1820163-4-ryan.roberts@arm.com
15 months agoKVM: selftests: arm64: Fix pte encode/decode for PA bits > 48
Ryan Roberts [Wed, 8 Mar 2023 11:09:47 +0000 (11:09 +0000)]
KVM: selftests: arm64: Fix pte encode/decode for PA bits > 48

The high bits [51:48] of a physical address should appear at [15:12] in
a 64K pte, not at [51:48] as was previously being programmed. Fix this
with new helper functions that do the conversion correctly. This also
sets us up nicely for adding LPA2 encodings in future.

Fixes: 7a6629ef746d ("kvm: selftests: add virt mem support for aarch64")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230308110948.1820163-3-ryan.roberts@arm.com
15 months agoKVM: selftests: Fixup config fragment for access_tracking_perf_test
Ryan Roberts [Wed, 8 Mar 2023 11:09:46 +0000 (11:09 +0000)]
KVM: selftests: Fixup config fragment for access_tracking_perf_test

access_tracking_perf_test requires CONFIG_IDLE_PAGE_TRACKING. However
this is missing from the config fragment, so add it in so that this test
is no longer skipped.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230308110948.1820163-2-ryan.roberts@arm.com
15 months agoKVM: arm64: selftests: Augment existing timer test to handle variable offset
Marc Zyngier [Thu, 30 Mar 2023 17:48:00 +0000 (18:48 +0100)]
KVM: arm64: selftests: Augment existing timer test to handle variable offset

Allow a user to specify the global offset on the command-line.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-21-maz@kernel.org
15 months agoKVM: arm64: selftests: Deal with spurious timer interrupts
Marc Zyngier [Thu, 30 Mar 2023 17:47:59 +0000 (18:47 +0100)]
KVM: arm64: selftests: Deal with spurious timer interrupts

Make sure the timer test can properly handle a spurious timer
interrupt, something that is far from being unlikely.

This involves checking for the GIC IAR return value (don't bother
handling the interrupt if it was spurious) as well as the timer
control register (don't do anything if the interrupt is masked
or the timer disabled). Take this opportunity to rewrite the
timer handler in a more readable way.

This solves a bunch of failures that creep up on systems that
are slow to retire the interrupt, something that the GIC architecture
makes no guarantee about.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-20-maz@kernel.org
15 months agoKVM: arm64: selftests: Add physical timer registers to the sysreg list
Marc Zyngier [Thu, 30 Mar 2023 17:47:58 +0000 (18:47 +0100)]
KVM: arm64: selftests: Add physical timer registers to the sysreg list

Now that KVM exposes CNTPCT_EL0, CNTP_CTL_EL0 and CNT_CVAL_EL0 to
userspace, add them to the get-reg-list selftest.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-19-maz@kernel.org
15 months agoKVM: arm64: nv: timers: Support hyp timer emulation
Marc Zyngier [Thu, 30 Mar 2023 17:47:57 +0000 (18:47 +0100)]
KVM: arm64: nv: timers: Support hyp timer emulation

Emulating EL2 also means emulating the EL2 timers. To do so, we expand
our timer framework to deal with at most 4 timers. At any given time,
two timers are using the HW timers, and the two others are purely
emulated.

The role of deciding which is which at any given time is left to a
mapping function which is called every time we need to make such a
decision.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-18-maz@kernel.org
15 months agoKVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
Marc Zyngier [Thu, 30 Mar 2023 17:47:56 +0000 (18:47 +0100)]
KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset

Being able to set a global offset isn't enough.

With NV, we also need to a per-vcpu, per-timer offset (for example,
CNTVCT_EL0 being offset by CNTVOFF_EL2).

Use a similar method as the VM-wide offset to have a timer point
to the shadow register that contains the offset value.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-17-maz@kernel.org
15 months agoKVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co
Marc Zyngier [Thu, 30 Mar 2023 17:47:55 +0000 (18:47 +0100)]
KVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co

Add some basic documentation on the effects of KVM_ARM_SET_CNT_OFFSETS.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-16-maz@kernel.org
15 months agoKVM: arm64: timers: Abstract the number of valid timers per vcpu
Marc Zyngier [Thu, 30 Mar 2023 17:47:54 +0000 (18:47 +0100)]
KVM: arm64: timers: Abstract the number of valid timers per vcpu

We so far have a pretty fixed number of timers to take care of.
This is about to change as NV brings another two into the
picture, and we must be careful not to try and emulate non-valid
timers in a given VM.

For this, abstract the number of timers for a given vcpu behind
an accessor, which helpfully returns a constant for now.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-15-maz@kernel.org
15 months agoKVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
Marc Zyngier [Thu, 30 Mar 2023 17:47:53 +0000 (18:47 +0100)]
KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling

Now that it is likely that CNTPCT_EL0 accesses will trap,
fast-track the emulation of the counter read which doesn't
need more that a simple offsetting.

One day, we'll have CNTPOFF everywhere. One day.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-14-maz@kernel.org
15 months agoKVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor
Marc Zyngier [Thu, 30 Mar 2023 17:47:52 +0000 (18:47 +0100)]
KVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor

For VHE-specific hypervisor code, kern_hyp_va() is a NOP.

Actually, it is a whole range of NOPs. It'd be much better if
this code simply didn't exist. Let's just do that.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-13-maz@kernel.org
15 months agoKVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
Marc Zyngier [Thu, 30 Mar 2023 17:47:51 +0000 (18:47 +0100)]
KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data

Having the timer IRQs duplicated into each vcpu isn't great, and
becomes absolutely awful with NV. So let's move these into
the per-VM arch_timer_vm_data structure.

This simplifies a lot of code, but requires us to introduce a
mutex so that we can reason about userspace trying to change
an interrupt number while another vcpu is running, something
that wasn't really well handled so far.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-12-maz@kernel.org
15 months agoKVM: arm64: timers: Abstract per-timer IRQ access
Marc Zyngier [Thu, 30 Mar 2023 17:47:50 +0000 (18:47 +0100)]
KVM: arm64: timers: Abstract per-timer IRQ access

As we are about to move the location of the per-timer IRQ into
the VM structure, abstract the location of the IRQ behind an
accessor. This will make the repainting sligntly less painful.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-11-maz@kernel.org
15 months agoKVM: arm64: timers: Rationalise per-vcpu timer init
Marc Zyngier [Thu, 30 Mar 2023 17:47:49 +0000 (18:47 +0100)]
KVM: arm64: timers: Rationalise per-vcpu timer init

The way we initialise our timer contexts may be satisfactory
for two timers, but will be getting pretty annoying with four.

Cleanup the whole thing by removing the code duplication and
getting rid of unused IRQ configuration elements.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-10-maz@kernel.org
15 months agoKVM: arm64: timers: Allow save/restoring of the physical timer
Marc Zyngier [Thu, 30 Mar 2023 17:47:48 +0000 (18:47 +0100)]
KVM: arm64: timers: Allow save/restoring of the physical timer

Nothing like being 10 year late to a party!

Now that userspace can set counter offsets, we can save/restore
the physical timer as well!

Nobody really cared so far, but you're welcome anyway.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-9-maz@kernel.org
15 months agoKVM: arm64: timers: Allow userspace to set the global counter offset
Marc Zyngier [Thu, 30 Mar 2023 17:47:47 +0000 (18:47 +0100)]
KVM: arm64: timers: Allow userspace to set the global counter offset

And this is the moment you have all been waiting for: setting the
counter offset from userspace.

We expose a brand new capability that reports the ability to set
the offset for both the virtual and physical sides.

In keeping with the architecture, the offset is expressed as
a delta that is substracted from the physical counter value.

Once this new API is used, there is no going back, and the counters
cannot be written to to set the offsets implicitly (the writes
are instead ignored).

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-8-maz@kernel.org
15 months agoKVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
Marc Zyngier [Thu, 30 Mar 2023 17:47:46 +0000 (18:47 +0100)]
KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM

Being able to lock/unlock all vcpus in one go is a feature that
only the vgic has enjoyed so far. Let's be brave and expose it
to the world.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-7-maz@kernel.org
15 months agoKVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
Marc Zyngier [Thu, 30 Mar 2023 17:47:45 +0000 (18:47 +0100)]
KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2

CNTPOFF_EL2 is awesome, but it is mostly vapourware, and no publicly
available implementation has it. So for the common mortals, let's
implement the emulated version of this thing.

It means trapping accesses to the physical counter and timer, and
emulate some of it as necessary.

As for CNTPOFF_EL2, nobody sets the offset yet.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-6-maz@kernel.org
15 months agoKVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
Marc Zyngier [Thu, 30 Mar 2023 17:47:44 +0000 (18:47 +0100)]
KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer

With ECV and CNTPOFF_EL2, it is very easy to offer an offset for
the physical timer. So let's do just that.

Nothing can set the offset yet, so this should have no effect
whatsoever (famous last words...).

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-5-maz@kernel.org
15 months agoarm64: Add HAS_ECV_CNTPOFF capability
Marc Zyngier [Thu, 30 Mar 2023 17:47:43 +0000 (18:47 +0100)]
arm64: Add HAS_ECV_CNTPOFF capability

Add the probing code for the FEAT_ECV variant that implements CNTPOFF_EL2.
Why it is optional is a mystery, but let's try and detect it.

Reviewed-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-4-maz@kernel.org
15 months agoarm64: Add CNTPOFF_EL2 register definition
Marc Zyngier [Thu, 30 Mar 2023 17:47:42 +0000 (18:47 +0100)]
arm64: Add CNTPOFF_EL2 register definition

Add the definition for CNTPOFF_EL2 in the description file.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-3-maz@kernel.org
15 months agoKVM: arm64: timers: Use a per-vcpu, per-timer accumulator for fractional ns
Marc Zyngier [Thu, 30 Mar 2023 17:47:41 +0000 (18:47 +0100)]
KVM: arm64: timers: Use a per-vcpu, per-timer accumulator for fractional ns

Instead of accumulating the fractional ns value generated every time
we compute a ns delta in a global variable, use a per-vcpu, per-timer
variable. This keeps the fractional ns local to the timer instead of
contributing to any odd, unrelated timer.

Reviewed-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230330174800.2677007-2-maz@kernel.org
15 months agoKVM: arm64: Use config_lock to protect vgic state
Oliver Upton [Mon, 27 Mar 2023 16:47:47 +0000 (16:47 +0000)]
KVM: arm64: Use config_lock to protect vgic state

Almost all of the vgic state is VM-scoped but accessed from the context
of a vCPU. These accesses were serialized on the kvm->lock which cannot
be nested within a vcpu->mutex critical section.

Move over the vgic state to using the config_lock. Tweak the lock
ordering where necessary to ensure that the config_lock is acquired
after the vcpu->mutex. Acquire the config_lock in kvm_vgic_create() to
avoid a race between the converted flows and GIC creation. Where
necessary, continue to acquire kvm->lock to avoid a race with vCPU
creation (i.e. flows that use lock_all_vcpus()).

Finally, promote the locking expectations in comments to lockdep
assertions and update the locking documentation for the config_lock as
well as vcpu->mutex.

Cc: stable@vger.kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-5-oliver.upton@linux.dev
15 months agoKVM: arm64: Use config_lock to protect data ordered against KVM_RUN
Oliver Upton [Mon, 27 Mar 2023 16:47:46 +0000 (16:47 +0000)]
KVM: arm64: Use config_lock to protect data ordered against KVM_RUN

There are various bits of VM-scoped data that can only be configured
before the first call to KVM_RUN, such as the hypercall bitmaps and
the PMU. As these fields are protected by the kvm->lock and accessed
while holding vcpu->mutex, this is yet another example of lock
inversion.

Change out the kvm->lock for kvm->arch.config_lock in all of these
instances. Opportunistically simplify the locking mechanics of the
PMU configuration by holding the config_lock for the entirety of
kvm_arm_pmu_v3_set_attr().

Note that this also addresses a couple of bugs. There is an unguarded
read of the PMU version in KVM_ARM_VCPU_PMU_V3_FILTER which could race
with KVM_ARM_VCPU_PMU_V3_SET_PMU. Additionally, until now writes to the
per-vCPU vPMU irq were not serialized VM-wide, meaning concurrent calls
to KVM_ARM_VCPU_PMU_V3_IRQ could lead to a false positive in
pmu_irq_is_valid().

Cc: stable@vger.kernel.org
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-4-oliver.upton@linux.dev
15 months agoKVM: arm64: Avoid lock inversion when setting the VM register width
Oliver Upton [Mon, 27 Mar 2023 16:47:45 +0000 (16:47 +0000)]
KVM: arm64: Avoid lock inversion when setting the VM register width

kvm->lock must be taken outside of the vcpu->mutex. Of course, the
locking documentation for KVM makes this abundantly clear. Nonetheless,
the locking order in KVM/arm64 has been wrong for quite a while; we
acquire the kvm->lock while holding the vcpu->mutex all over the shop.

All was seemingly fine until commit 42a90008f890 ("KVM: Ensure lockdep
knows about kvm->lock vs. vcpu->mutex ordering rule") caught us with our
pants down, leading to lockdep barfing:

 ======================================================
 WARNING: possible circular locking dependency detected
 6.2.0-rc7+ #19 Not tainted
 ------------------------------------------------------
 qemu-system-aar/859 is trying to acquire lock:
 ffff5aa69269eba0 (&host_kvm->lock){+.+.}-{3:3}, at: kvm_reset_vcpu+0x34/0x274

 but task is already holding lock:
 ffff5aa68768c0b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8c/0xba0

 which lock already depends on the new lock.

Add a dedicated lock to serialize writes to VM-scoped configuration from
the context of a vCPU. Protect the register width flags with the new
lock, thus avoiding the need to grab the kvm->lock while holding
vcpu->mutex in kvm_reset_vcpu().

Cc: stable@vger.kernel.org
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Link: https://lore.kernel.org/kvmarm/f6452cdd-65ff-34b8-bab0-5c06416da5f6@arm.com/
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-3-oliver.upton@linux.dev
15 months agoKVM: arm64: Avoid vcpu->mutex v. kvm->lock inversion in CPU_ON
Oliver Upton [Mon, 27 Mar 2023 16:47:44 +0000 (16:47 +0000)]
KVM: arm64: Avoid vcpu->mutex v. kvm->lock inversion in CPU_ON

KVM/arm64 had the lock ordering backwards on vcpu->mutex and kvm->lock
from the very beginning. One such example is the way vCPU resets are
handled: the kvm->lock is acquired while handling a guest CPU_ON PSCI
call.

Add a dedicated lock to serialize writes to kvm_vcpu_arch::{mp_state,
reset_state}. Promote all accessors of mp_state to {READ,WRITE}_ONCE()
as readers do not acquire the mp_state_lock. While at it, plug yet
another race by taking the mp_state_lock in the KVM_SET_MP_STATE ioctl
handler.

As changes to MP state are now guarded with a dedicated lock, drop the
kvm->lock acquisition from the PSCI CPU_ON path. Similarly, move the
reader of reset_state outside of the kvm->lock and instead protect it
with the mp_state_lock. Note that writes to reset_state::reset have been
demoted to regular stores as both readers and writers acquire the
mp_state_lock.

While the kvm->lock inversion still exists in kvm_reset_vcpu(), at least
now PSCI CPU_ON no longer depends on it for serializing vCPU reset.

Cc: stable@vger.kernel.org
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-2-oliver.upton@linux.dev
15 months agoLinux 6.3-rc4
Linus Torvalds [Sun, 26 Mar 2023 21:40:20 +0000 (14:40 -0700)]
Linux 6.3-rc4

15 months agoMerge tag 'usb-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 26 Mar 2023 17:22:44 +0000 (10:22 -0700)]
Merge tag 'usb-6.3-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB / Thunderbolt driver fixes from Greg KH:
 "Here are a small set of USB and Thunderbolt driver fixes for reported
  problems and a documentation update, for 6.3-rc4.

  Included in here are:

   - documentation update for uvc gadget driver

   - small thunderbolt driver fixes

   - cdns3 driver fixes

   - dwc3 driver fixes

   - dwc2 driver fixes

   - chipidea driver fixes

   - typec driver fixes

   - onboard_usb_hub device id updates

   - quirk updates

  All of these have been in linux-next with no reported problems"

* tag 'usb-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (30 commits)
  usb: dwc2: fix a race, don't power off/on phy for dual-role mode
  usb: dwc2: fix a devres leak in hw_enable upon suspend resume
  usb: chipidea: core: fix possible concurrent when switch role
  usb: chipdea: core: fix return -EINVAL if request role is the same with current role
  thunderbolt: Rename shadowed variables bit to interrupt_bit and auto_clear_bit
  thunderbolt: Disable interrupt auto clear for rings
  thunderbolt: Use const qualifier for `ring_interrupt_index`
  usb: gadget: Use correct endianness of the wLength field for WebUSB
  uas: Add US_FL_NO_REPORT_OPCODES for JMicron JMS583Gen 2
  usb: cdnsp: changes PCI Device ID to fix conflict with CNDS3 driver
  usb: cdns3: Fix issue with using incorrect PCI device function
  usb: cdnsp: Fixes issue with redundant Status Stage
  MAINTAINERS: make me a reviewer of USB/IP
  thunderbolt: Use scale field when allocating USB3 bandwidth
  thunderbolt: Limit USB3 bandwidth of certain Intel USB4 host routers
  thunderbolt: Call tb_check_quirks() after initializing adapters
  thunderbolt: Add missing UNSET_INBOUND_SBTX for retimer access
  thunderbolt: Fix memory leak in margining
  usb: dwc2: drd: fix inconsistent mode if role-switch-default-mode="host"
  docs: usb: Add documentation for the UVC Gadget
  ...

15 months agoMerge tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 26 Mar 2023 16:18:30 +0000 (09:18 -0700)]
Merge tag 'sched_urgent_for_v6.3_rc4' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Fix a corner case where vruntime of a task is not being sanitized

* tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Sanitize vruntime of entity being migrated

15 months agoMerge tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 26 Mar 2023 16:13:35 +0000 (09:13 -0700)]
Merge tag 'perf_urgent_for_v6.3_rc4' of git://git./linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Properly clear perf event status tracking in the AMD perf event
   overflow handler

* tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/core: Always clear status for idx

15 months agoMerge tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 26 Mar 2023 16:06:20 +0000 (09:06 -0700)]
Merge tag 'core_urgent_for_v6.3_rc4' of git://git./linux/kernel/git/tip/tip

Pull core fixes from Borislav Petkov:

 - Do the delayed RCU wakeup for kthreads in the proper order so that
   former doesn't get ignored

 - A noinstr warning fix

* tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
  entry: Fix noinstr warning in __enter_from_user_mode()

15 months agoMerge tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 26 Mar 2023 16:01:24 +0000 (09:01 -0700)]
Merge tag 'x86_urgent_for_v6.3_rc4' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add a AMX ptrace self test

 - Prevent a false-positive warning when retrieving the (invalid)
   address of dynamic FPU features in their init state which are not
   saved in init_fpstate at all

 - Randomize per-CPU entry areas only when KASLR is enabled

* tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/x86/amx: Add a ptrace test
  x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf()
  x86/mm: Do not shuffle CPU entry areas without KASLR

15 months agoMerge tag 'smb3-client-fixes-6.3-rc3' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 26 Mar 2023 15:56:09 +0000 (08:56 -0700)]
Merge tag 'smb3-client-fixes-6.3-rc3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Twelve cifs/smb3 client fixes (most also for stable)

   - forced umount fix

   - fix for two perf regressions

   - reconnect fixes

   - small debugging improvements

   - multichannel fixes"

* tag 'smb3-client-fixes-6.3-rc3' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix unusable share after force unmount failure
  cifs: fix dentry lookups in directory handle cache
  smb3: lower default deferred close timeout to address perf regression
  cifs: fix missing unload_nls() in smb2_reconnect()
  cifs: avoid race conditions with parallel reconnects
  cifs: append path to open_enter trace event
  cifs: print session id while listing open files
  cifs: dump pending mids for all channels in DebugData
  cifs: empty interface list when server doesn't support query interfaces
  cifs: do not poll server interfaces too regularly
  cifs: lock chan_lock outside match_session
  cifs: check only tcon status on tcon related functions

15 months agoMerge tag 'nfsd-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Sat, 25 Mar 2023 20:32:43 +0000 (13:32 -0700)]
Merge tag 'nfsd-6.3-4' of git://git./linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix a crash when using NFS with krb5p

* tag 'nfsd-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix a crash in gss_krb5_checksum()

15 months agoMerge tag 'xfs-6.3-fixes-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Sat, 25 Mar 2023 20:12:36 +0000 (13:12 -0700)]
Merge tag 'xfs-6.3-fixes-7' of git://git./fs/xfs/xfs-linux

Pull yet more xfs bug fixes from Darrick Wong:
 "The first bugfix addresses a longstanding problem where we use the
  wrong file mapping cursors when trying to compute the speculative
  preallocation quantity. This has been causing sporadic crashes when
  alwayscow mode is engaged.

  The other two fixes correct minor problems in more recent changes.

   - Fix the new allocator tracepoints because git am mismerged the
     changes such that the trace_XXX got rebased to be in function YYY
     instead of XXX

   - Ensure that the perag AGFL_RESET state is consistent with whatever
     we've just read off the disk

   - Fix a bug where we used the wrong iext cursor during a write begin"

* tag 'xfs-6.3-fixes-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix mismerged tracepoints
  xfs: clear incore AGFL_RESET state if it's not needed
  xfs: pass the correct cursor to xfs_iomap_prealloc_size