platform/kernel/linux-starfive.git
18 months agoMerge branch 'kvm-late-6.1' into HEAD
Paolo Bonzini [Thu, 29 Dec 2022 20:35:48 +0000 (15:35 -0500)]
Merge branch 'kvm-late-6.1' into HEAD

x86:

* Change tdp_mmu to a read-only parameter

* Separate TDP and shadow MMU page fault paths

* Enable Hyper-V invariant TSC control

selftests:

* Use TAP interface for kvm_binary_stats_test and tsc_msrs_test

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Test Hyper-V invariant TSC control
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:49 +0000 (11:58 +0200)]
KVM: selftests: Test Hyper-V invariant TSC control

Add a test for the newly introduced Hyper-V invariant TSC control feature:
- HV_X64_MSR_TSC_INVARIANT_CONTROL is not available without
 HV_ACCESS_TSC_INVARIANT CPUID bit set and available with it.
- BIT(0) of HV_X64_MSR_TSC_INVARIANT_CONTROL controls the filtering of
architectural invariant TSC (CPUID.80000007H:EDX[8]) bit.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Test that values written to Hyper-V MSRs are preserved
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:48 +0000 (11:58 +0200)]
KVM: selftests: Test that values written to Hyper-V MSRs are preserved

Enhance 'hyperv_features' selftest by adding a check that KVM
preserves values written to PV MSRs. Two MSRs are, however, 'special':
- HV_X64_MSR_EOI as it is a 'write-only' MSR,
- HV_X64_MSR_RESET as it always reads as '0'.
The later doesn't require any special handling right now because the
test never writes anything besides '0' to the MSR, leave a TODO node
about the fact.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Convert hyperv_features test to using KVM_X86_CPU_FEATURE()
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:47 +0000 (11:58 +0200)]
KVM: selftests: Convert hyperv_features test to using KVM_X86_CPU_FEATURE()

hyperv_features test needs to set certain CPUID bits in Hyper-V feature
leaves but instead of open coding this, common KVM_X86_CPU_FEATURE()
infrastructure can be used.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Rename 'msr->available' to 'msr->fault_exepected' in hyperv_features...
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:46 +0000 (11:58 +0200)]
KVM: selftests: Rename 'msr->available' to 'msr->fault_exepected' in hyperv_features test

It may not be clear what 'msr->available' means. The test actually
checks that accessing the particular MSR doesn't cause #GP, rename
the variable accordingly.

While on it, use 'true'/'false' instead of '1'/'0' for 'write'/
'fault_expected' as these are boolean.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: Hyper-V invariant TSC control
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:45 +0000 (11:58 +0200)]
KVM: x86: Hyper-V invariant TSC control

Normally, genuine Hyper-V doesn't expose architectural invariant TSC
(CPUID.80000007H:EDX[8]) to its guests by default. A special PV MSR
(HV_X64_MSR_TSC_INVARIANT_CONTROL, 0x40000118) and corresponding CPUID
feature bit (CPUID.0x40000003.EAX[15]) were introduced. When bit 0 of the
PV MSR is set, invariant TSC bit starts to show up in CPUID. When the
feature is exposed to Hyper-V guests, reenlightenment becomes unneeded.

Add the feature to KVM. Keep CPUID output intact when the feature
wasn't exposed to L1 and implement the required logic for hiding
invariant TSC when the feature was exposed and invariant TSC control
MSR wasn't written to. Copy genuine Hyper-V behavior and forbid to
disable the feature once it was enabled.

For the reference, for linux guests, support for the feature was added
in commit dce7cd62754b ("x86/hyperv: Allow guests to enable InvariantTSC").

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: Add a KVM-only leaf for CPUID_8000_0007_EDX
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:44 +0000 (11:58 +0200)]
KVM: x86: Add a KVM-only leaf for CPUID_8000_0007_EDX

CPUID_8000_0007_EDX may come handy when X86_FEATURE_CONSTANT_TSC
needs to be checked.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agox86/hyperv: Add HV_EXPOSE_INVARIANT_TSC define
Vitaly Kuznetsov [Thu, 13 Oct 2022 09:58:43 +0000 (11:58 +0200)]
x86/hyperv: Add HV_EXPOSE_INVARIANT_TSC define

Avoid open coding BIT(0) of HV_X64_MSR_TSC_INVARIANT_CONTROL by adding
a dedicated define. While there's only one user at this moment, the
upcoming KVM implementation of Hyper-V Invariant TSC feature will need
to use it as well.

Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Pivot on "TDP MMU enabled" when handling direct page faults
Sean Christopherson [Wed, 12 Oct 2022 18:16:58 +0000 (18:16 +0000)]
KVM: x86/mmu: Pivot on "TDP MMU enabled" when handling direct page faults

When handling direct page faults, pivot on the TDP MMU being globally
enabled instead of checking if the target MMU is a TDP MMU.  Now that the
TDP MMU is all-or-nothing, if the TDP MMU is enabled, KVM will reach
direct_page_fault() if and only if the MMU is a TDP MMU.  When TDP is
enabled (obviously required for the TDP MMU), only non-nested TDP page
faults reach direct_page_fault(), i.e. nonpaging MMUs are impossible, as
NPT requires paging to be enabled and EPT faults use ept_page_fault().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221012181702.3663607-8-seanjc@google.com>
[Use tdp_mmu_enabled variable. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Pivot on "TDP MMU enabled" to check if active MMU is TDP MMU
Sean Christopherson [Wed, 12 Oct 2022 18:16:59 +0000 (18:16 +0000)]
KVM: x86/mmu: Pivot on "TDP MMU enabled" to check if active MMU is TDP MMU

Simplify and optimize the logic for detecting if the current/active MMU
is a TDP MMU.  If the TDP MMU is globally enabled, then the active MMU is
a TDP MMU if it is direct.  When TDP is enabled, so called nonpaging MMUs
are never used as the only form of shadow paging KVM uses is for nested
TDP, and the active MMU can't be direct in that case.

Rename the helper and take the vCPU instead of an arbitrary MMU, as
nonpaging MMUs can show up in the walk_mmu if L1 is using nested TDP and
L2 has paging disabled.  Taking the vCPU has the added bonus of cleaning
up the callers, all of which check the current MMU but wrap code that
consumes the vCPU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221012181702.3663607-9-seanjc@google.com>
[Use tdp_mmu_enabled variable. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Replace open coded usage of tdp_mmu_page with is_tdp_mmu_page()
Sean Christopherson [Wed, 12 Oct 2022 18:17:00 +0000 (18:17 +0000)]
KVM: x86/mmu: Replace open coded usage of tdp_mmu_page with is_tdp_mmu_page()

Use is_tdp_mmu_page() instead of querying sp->tdp_mmu_page directly so
that all users benefit if KVM ever finds a way to optimize the logic.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221012181702.3663607-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Rename __direct_map() to direct_map()
David Matlack [Wed, 21 Sep 2022 17:35:46 +0000 (10:35 -0700)]
KVM: x86/mmu: Rename __direct_map() to direct_map()

Rename __direct_map() to direct_map() since the leading underscores are
unnecessary. This also makes the page fault handler names more
consistent: kvm_tdp_mmu_page_fault() calls kvm_tdp_mmu_map() and
direct_page_fault() calls direct_map().

Opportunistically make some trivial cleanups to comments that had to be
modified anyway since they mentioned __direct_map(). Specifically, use
"()" when referring to functions, and include kvm_tdp_mmu_map() among
the various callers of disallowed_hugepage_adjust().

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Stop needlessly making MMU pages available for TDP MMU faults
David Matlack [Wed, 21 Sep 2022 17:35:45 +0000 (10:35 -0700)]
KVM: x86/mmu: Stop needlessly making MMU pages available for TDP MMU faults

Stop calling make_mmu_pages_available() when handling TDP MMU faults.
The TDP MMU does not participate in the "available MMU pages" tracking
and limiting so calling this function is unnecessary work when handling
TDP MMU faults.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-10-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Split out TDP MMU page fault handling
David Matlack [Wed, 21 Sep 2022 17:35:44 +0000 (10:35 -0700)]
KVM: x86/mmu: Split out TDP MMU page fault handling

Split out the page fault handling for the TDP MMU to a separate
function.  This creates some duplicate code, but makes the TDP MMU fault
handler simpler to read by eliminating branches and will enable future
cleanups by allowing the TDP MMU and non-TDP MMU fault paths to diverge.

Only compile in the TDP MMU fault handler for 64-bit builds since
kvm_tdp_mmu_map() does not exist in 32-bit builds.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-9-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Initialize fault.{gfn,slot} earlier for direct MMUs
David Matlack [Wed, 21 Sep 2022 17:35:43 +0000 (10:35 -0700)]
KVM: x86/mmu: Initialize fault.{gfn,slot} earlier for direct MMUs

Move the initialization of fault.{gfn,slot} earlier in the page fault
handling code for fully direct MMUs. This will enable a future commit to
split out TDP MMU page fault handling without needing to duplicate the
initialization of these 2 fields.

Opportunistically take advantage of the fact that fault.gfn is
initialized in kvm_tdp_page_fault() rather than recomputing it from
fault->addr.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Handle no-slot faults in kvm_faultin_pfn()
David Matlack [Wed, 21 Sep 2022 17:35:42 +0000 (10:35 -0700)]
KVM: x86/mmu: Handle no-slot faults in kvm_faultin_pfn()

Handle faults on GFNs that do not have a backing memslot in
kvm_faultin_pfn() and drop handle_abnormal_pfn(). This eliminates
duplicate code in the various page fault handlers.

Opportunistically tweak the comment about handling gfn > host.MAXPHYADDR
to reflect that the effect of returning RET_PF_EMULATE at that point is
to avoid creating an MMIO SPTE for such GFNs.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Avoid memslot lookup during KVM_PFN_ERR_HWPOISON handling
David Matlack [Wed, 21 Sep 2022 17:35:41 +0000 (10:35 -0700)]
KVM: x86/mmu: Avoid memslot lookup during KVM_PFN_ERR_HWPOISON handling

Pass the kvm_page_fault struct down to kvm_handle_error_pfn() to avoid a
memslot lookup when handling KVM_PFN_ERR_HWPOISON. Opportunistically
move the gfn_to_hva_memslot() call and @current down into
kvm_send_hwpoison_signal() to cut down on line lengths.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Handle error PFNs in kvm_faultin_pfn()
David Matlack [Wed, 21 Sep 2022 17:35:40 +0000 (10:35 -0700)]
KVM: x86/mmu: Handle error PFNs in kvm_faultin_pfn()

Handle error PFNs in kvm_faultin_pfn() rather than relying on the caller
to invoke handle_abnormal_pfn() after kvm_faultin_pfn().
Opportunistically rename kvm_handle_bad_page() to kvm_handle_error_pfn()
to make it more consistent with is_error_pfn().

This commit moves KVM closer to being able to drop
handle_abnormal_pfn(), which will reduce the amount of duplicate code in
the various page fault handlers.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()
David Matlack [Wed, 21 Sep 2022 17:35:39 +0000 (10:35 -0700)]
KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()

Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct
kvm_page_fault. The eliminates duplicate code and reduces the amount of
parameters needed for is_page_fault_stale().

Preemptively split out __kvm_faultin_pfn() to a separate function for
use in subsequent commits.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Move TDP MMU VM init/uninit behind tdp_mmu_enabled
David Matlack [Wed, 21 Sep 2022 17:35:38 +0000 (10:35 -0700)]
KVM: x86/mmu: Move TDP MMU VM init/uninit behind tdp_mmu_enabled

Move kvm_mmu_{init,uninit}_tdp_mmu() behind tdp_mmu_enabled. This makes
these functions consistent with the rest of the calls into the TDP MMU
from mmu.c, and which is now possible since tdp_mmu_enabled is only
modified when the x86 vendor module is loaded. i.e. It will never change
during the lifetime of a VM.

This change also enabled removing the stub definitions for 32-bit KVM,
as the compiler will just optimize the calls out like it does for all
the other TDP MMU functions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Change tdp_mmu to a read-only parameter
David Matlack [Wed, 21 Sep 2022 17:35:37 +0000 (10:35 -0700)]
KVM: x86/mmu: Change tdp_mmu to a read-only parameter

Change tdp_mmu to a read-only parameter and drop the per-vm
tdp_mmu_enabled. For 32-bit KVM, make tdp_mmu_enabled a macro that is
always false so that the compiler can continue omitting cals to the TDP
MMU.

The TDP MMU was introduced in 5.10 and has been enabled by default since
5.15. At this point there are no known functionality gaps between the
TDP MMU and the shadow MMU, and the TDP MMU uses less memory and scales
better with the number of vCPUs. In other words, there is no good reason
to disable the TDP MMU on a live system.

Purposely do not drop tdp_mmu=N support (i.e. do not force 64-bit KVM to
always use the TDP MMU) since tdp_mmu=N is still used to get test
coverage of KVM's shadow MMU TDP support, which is used in 32-bit KVM.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: x86: Use TAP interface in the tsc_msrs_test
Thomas Huth [Tue, 4 Oct 2022 09:31:31 +0000 (11:31 +0200)]
KVM: selftests: x86: Use TAP interface in the tsc_msrs_test

Let's add some output here so that the user has some feedback
about what is being run.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20221004093131.40392-4-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Use TAP interface in the kvm_binary_stats_test
Thomas Huth [Tue, 4 Oct 2022 09:31:29 +0000 (11:31 +0200)]
KVM: selftests: Use TAP interface in the kvm_binary_stats_test

The kvm_binary_stats_test test currently does not have any output (unless
one of the TEST_ASSERT statement fails), so it's hard to say for a user
how far it did proceed already. Thus let's make this a little bit more
user-friendly and include some TAP output via the kselftest.h interface.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Andrew Jones <andrew.jones@linux.dev>
Message-Id: <20221004093131.40392-2-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agokvm: x86/mmu: Warn on linking when sp->unsync_children
Lai Jiangshan [Mon, 12 Dec 2022 09:01:06 +0000 (17:01 +0800)]
kvm: x86/mmu: Warn on linking when sp->unsync_children

Since the commit 65855ed8b034 ("KVM: X86: Synchronize the shadow
pagetable before link it"), no sp would be linked with
sp->unsync_children = 1.

So make it WARN if it is the case.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221212090106.378206-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: VMX: Resurrect vmcs_conf sanitization for KVM-on-Hyper-V
Vitaly Kuznetsov [Fri, 4 Nov 2022 14:47:08 +0000 (15:47 +0100)]
KVM: VMX: Resurrect vmcs_conf sanitization for KVM-on-Hyper-V

Commit 9bcb90650e31 ("KVM: VMX: Get rid of eVMCS specific VMX controls
sanitization") dropped 'vmcs_conf' sanitization for KVM-on-Hyper-V because
there's no known Hyper-V version which would expose a feature
unsupported in eVMCS in VMX feature MSRs. This works well for all
currently existing Hyper-V version, however, future Hyper-V versions
may add features which are supported by KVM and are currently missing
in eVMCSv1 definition (e.g. APIC virtualization, PML,...). When this
happens, existing KVMs will get broken. With the inverted 'unsupported
by eVMCSv1' checks, we can resurrect vmcs_conf sanitization and make
KVM future proof.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: nVMX: Prepare to sanitize tertiary execution controls with eVMCS
Vitaly Kuznetsov [Fri, 4 Nov 2022 14:47:07 +0000 (15:47 +0100)]
KVM: nVMX: Prepare to sanitize tertiary execution controls with eVMCS

In preparation to restoring vmcs_conf sanitization for KVM-on-Hyper-V,
(and for completeness) add tertiary VM-execution controls to
'evmcs_supported_ctrls'.

No functional change intended as KVM doesn't yet expose
MSR_IA32_VMX_PROCBASED_CTLS3 to its guests.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: nVMX: Invert 'unsupported by eVMCSv1' check
Vitaly Kuznetsov [Fri, 4 Nov 2022 14:47:06 +0000 (15:47 +0100)]
KVM: nVMX: Invert 'unsupported by eVMCSv1' check

When a new feature gets implemented in KVM, EVMCS1_UNSUPPORTED_* defines
need to be adjusted to avoid the situation when the feature is exposed
to the guest but there's no corresponding eVMCS field[s] for it. This
is not obvious and fragile. Invert 'unsupported by eVMCSv1' check and
make it 'supported by eVMCSv1' instead, this way it's much harder to
make a mistake. New features will get added to EVMCS1_SUPPORTED_*
defines when the corresponding fields are added to eVMCS definition.

No functional change intended. EVMCS1_SUPPORTED_* defines are composed
by taking KVM_{REQUIRED,OPTIONAL}_VMX_ defines and filtering out what
was previously known as EVMCS1_UNSUPPORTED_*.

From all the controls, SECONDARY_EXEC_TSC_SCALING requires special
handling as it's actually present in eVMCSv1 definition but is not
currently supported for Hyper-V-on-KVM, just for KVM-on-Hyper-V. As
evmcs_supported_ctrls will be used for both scenarios, just add it
there instead of EVMCS1_SUPPORTED_2NDEXEC.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: nVMX: Sanitize primary processor-based VM-execution controls with eVMCS too
Vitaly Kuznetsov [Fri, 4 Nov 2022 14:47:05 +0000 (15:47 +0100)]
KVM: nVMX: Sanitize primary processor-based VM-execution controls with eVMCS too

The only unsupported primary processor-based VM-execution control at the
moment is CPU_BASED_ACTIVATE_TERTIARY_CONTROLS and KVM doesn't expose it
in nested VMX feature MSRs anyway (see nested_vmx_setup_ctls_msrs())
but in preparation to inverting "unsupported with eVMCS" checks (and
for completeness) it's better to sanitize MSR_IA32_VMX_PROCBASED_CTLS/
MSR_IA32_VMX_TRUE_PROCBASED_CTLS too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: restore special vmmcall code layout needed by the harness
Paolo Bonzini [Wed, 30 Nov 2022 18:11:47 +0000 (13:11 -0500)]
KVM: selftests: restore special vmmcall code layout needed by the harness

Commit 8fda37cf3d41 ("KVM: selftests: Stuff RAX/RCX with 'safe' values
in vmmcall()/vmcall()", 2022-11-21) broke the svm_nested_soft_inject_test
because it placed a "pop rbp" instruction after vmmcall.  While this is
correct and mimics what is done in the VMX case, this particular test
expects a ud2 instruction right after the vmmcall, so that it can skip
over it in the L1 part of the test.

Inline a suitably-modified version of vmmcall() to restore the
functionality of the test.

Fixes: 8fda37cf3d41 ("KVM: selftests: Stuff RAX/RCX with 'safe' values in vmmcall()/vmcall()"
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221130181147.9911-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoDocumentation: kvm: clarify SRCU locking order
Paolo Bonzini [Wed, 28 Dec 2022 11:00:22 +0000 (06:00 -0500)]
Documentation: kvm: clarify SRCU locking order

Currently only the locking order of SRCU vs kvm->slots_arch_lock
and kvm->slots_lock is documented.  Extend this to kvm->lock
since Xen emulation got it terribly wrong.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET
Paolo Bonzini [Wed, 28 Dec 2022 10:33:41 +0000 (05:33 -0500)]
KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET

While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running,
if that happened it could cause a deadlock.  This is due to
kvm_xen_eventfd_reset() doing a synchronize_srcu() inside
a kvm->lock critical section.

To avoid this, first collect all the evtchnfd objects in an
array and free all of them once the kvm->lock critical section
is over and th SRCU grace period has expired.

Reported-by: Michal Luczaj <mhal@rbox.co>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/xen: Documentation updates and clarifications
David Woodhouse [Mon, 26 Dec 2022 12:03:20 +0000 (12:03 +0000)]
KVM: x86/xen: Documentation updates and clarifications

Most notably, the KVM_XEN_EVTCHN_RESET feature had escaped documentation
entirely. Along with how to turn most stuff off on SHUTDOWN_soft_reset.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi
David Woodhouse [Mon, 26 Dec 2022 12:03:19 +0000 (12:03 +0000)]
KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi

These are (uint64_t)-1 magic values are a userspace ABI, allowing the
shared info pages and other enlightenments to be disabled. This isn't
a Xen ABI because Xen doesn't let the guest turn these off except with
the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is
expected to handle soft reset, and tear down the kernel parts of the
enlightenments accordingly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/xen: Simplify eventfd IOCTLs
Michal Luczaj [Mon, 26 Dec 2022 12:03:18 +0000 (12:03 +0000)]
KVM: x86/xen: Simplify eventfd IOCTLs

Port number is validated in kvm_xen_setattr_evtchn().
Remove superfluous checks in kvm_xen_eventfd_assign() and
kvm_xen_eventfd_update().

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20221222203021.1944101-3-mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports
Paolo Bonzini [Mon, 26 Dec 2022 12:03:17 +0000 (12:03 +0000)]
KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports

The evtchnfd structure itself must be protected by either kvm->lock or
SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is
being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and
does not need kvm->lock, and is called in SRCU critical section from the
kvm_x86_handle_exit function.

It is also important to use rcu_read_{lock,unlock}() in
kvm_xen_hcall_evtchn_send(), because idr_remove() will *not*
use synchronize_srcu() to wait for readers to complete.

Remove a superfluous if (kvm) check before calling synchronize_srcu()
in kvm_xen_eventfd_deassign() where kvm has been dereferenced already.

Co-developed-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly
David Woodhouse [Mon, 26 Dec 2022 12:03:16 +0000 (12:03 +0000)]
KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly

In particular, we shouldn't assume that being contiguous in guest virtual
address space means being contiguous in guest *physical* address space.

In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop
the srcu_read_lock() that was around them. All call sites are reached
from kvm_xen_hypercall() which is called from the handle_exit function
with the read lock already held.

       536395260 ("KVM: x86/xen: handle PV timers oneshot mode")
       1a65105a5 ("KVM: x86/xen: handle PV spinlocks slowpath")

Fixes: 2fd6df2f2 ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()
Michal Luczaj [Mon, 26 Dec 2022 12:03:15 +0000 (12:03 +0000)]
KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()

Release page irrespectively of kvm_vcpu_write_guest() return value.

Suggested-by: Paul Durrant <paul@xen.org>
Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20221220151454.712165-1-mhal@rbox.co>
Reviewed-by: Paul Durrant <paul@xen.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-1-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: Delete extra block of "};" in the KVM API documentation
Sean Christopherson [Wed, 7 Dec 2022 00:36:37 +0000 (00:36 +0000)]
KVM: Delete extra block of "};" in the KVM API documentation

Delete an extra block of code/documentation that snuck in when KVM's
documentation was converted to ReST format.

Fixes: 106ee47dc633 ("docs: kvm: Convert api.txt to ReST format")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221207003637.2041211-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agokvm: x86/mmu: Remove duplicated "be split" in spte.h
Lai Jiangshan [Wed, 7 Dec 2022 12:05:05 +0000 (20:05 +0800)]
kvm: x86/mmu: Remove duplicated "be split" in spte.h

"be split be split" -> "be split"

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221207120505.9175-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agokvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()
Lai Jiangshan [Wed, 7 Dec 2022 12:06:16 +0000 (20:06 +0800)]
kvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()

No code is using KVM_MMU_READ_LOCK() or KVM_MMU_READ_UNLOCK().  They
used to be in virt/kvm/pfncache.c:

                KVM_MMU_READ_LOCK(kvm);
                retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
                KVM_MMU_READ_UNLOCK(kvm);

However, since 58cd407ca4c6 ("KVM: Fix multiple races in gfn=>pfn cache
refresh", 2022-05-25) the code is only relying on the MMU notifier's
invalidation count and sequence number.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221207120617.9409-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoMAINTAINERS: adjust entry after renaming the vmx hyperv files
Lukas Bulwahn [Mon, 5 Dec 2022 08:20:44 +0000 (09:20 +0100)]
MAINTAINERS: adjust entry after renaming the vmx hyperv files

Commit a789aeba4196 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to
"vmx/hyperv.{ch}"") renames the VMX specific Hyper-V files, but does not
adjust the entry in MAINTAINERS.

Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a
broken reference.

Repair this file reference in KVM X86 HYPER-V (KVM/hyper-v).

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Fixes: a789aeba4196 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to "vmx/hyperv.{ch}"")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221205082044.10141-1-lukas.bulwahn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Mark correct page as mapped in virt_map()
Oliver Upton [Fri, 9 Dec 2022 01:53:02 +0000 (01:53 +0000)]
KVM: selftests: Mark correct page as mapped in virt_map()

The loop marks vaddr as mapped after incrementing it by page size,
thereby marking the *next* page as mapped. Set the bit in vpages_mapped
first instead.

Fixes: 56fc7732031d ("KVM: selftests: Fill in vm->vpages_mapped bitmap in virt_map() too")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20221209015307.1781352-4-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: arm64: selftests: Don't identity map the ucall MMIO hole
Oliver Upton [Fri, 9 Dec 2022 01:53:04 +0000 (01:53 +0000)]
KVM: arm64: selftests: Don't identity map the ucall MMIO hole

Currently the ucall MMIO hole is placed immediately after slot0, which
is a relatively safe address in the PA space. However, it is possible
that the same address has already been used for something else (like the
guest program image) in the VA space. At least in my own testing,
building the vgic_irq test with clang leads to the MMIO hole appearing
underneath gicv3_ops.

Stop identity mapping the MMIO hole and instead find an unused VA to map
to it. Yet another subtle detail of the KVM selftests library is that
virt_pg_map() does not update vm->vpages_mapped. Switch over to
virt_map() instead to guarantee that the chosen VA isn't to something
else.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20221209015307.1781352-6-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: document the default implementation of vm_vaddr_populate_bitmap
Paolo Bonzini [Mon, 12 Dec 2022 10:36:53 +0000 (05:36 -0500)]
KVM: selftests: document the default implementation of vm_vaddr_populate_bitmap

Explain the meaning of the bit manipulations of vm_vaddr_populate_bitmap.
These correspond to the "canonical addresses" of x86 and other
architectures, but that is not obvious.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Use magic value to signal ucall_alloc() failure
Sean Christopherson [Fri, 9 Dec 2022 20:55:44 +0000 (12:55 -0800)]
KVM: selftests: Use magic value to signal ucall_alloc() failure

Use a magic value to signal a ucall_alloc() failure instead of simply
doing GUEST_ASSERT().  GUEST_ASSERT() relies on ucall_alloc() and so a
failure puts the guest into an infinite loop.

Use -1 as the magic value, as a real ucall struct should never wrap.

Reported-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning
Sean Christopherson [Tue, 13 Dec 2022 00:16:50 +0000 (00:16 +0000)]
KVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning

Disable gnu-variable-sized-type-not-at-end so that tests and libraries
can create overlays of variable sized arrays at the end of structs when
using a fixed number of entries, e.g. to get/set a single MSR.

It's possible to fudge around the warning, e.g. by defining a custom
struct that hardcodes the number of entries, but that is a burden for
both developers and readers of the code.

lib/x86_64/processor.c:664:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
                struct kvm_msrs header;
                                ^
lib/x86_64/processor.c:772:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
                struct kvm_msrs header;
                                ^
lib/x86_64/processor.c:787:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
                struct kvm_msrs header;
                                ^
3 warnings generated.

x86_64/hyperv_tlb_flush.c:54:18: warning: field 'hv_vp_set' with variable sized type 'struct hv_vpset'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
        struct hv_vpset hv_vp_set;
                        ^
1 warning generated.

x86_64/xen_shinfo_test.c:137:25: warning: field 'info' with variable sized type 'struct kvm_irq_routing'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
        struct kvm_irq_routing info;
                               ^
1 warning generated.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Include lib.mk before consuming $(CC)
Sean Christopherson [Tue, 13 Dec 2022 00:16:49 +0000 (00:16 +0000)]
KVM: selftests: Include lib.mk before consuming $(CC)

Include lib.mk before consuming $(CC) and document that lib.mk overwrites
$(CC) unless make was invoked with -e or $(CC) was specified after make
(which makes the environment override the Makefile).  Including lib.mk
after using it for probing, e.g. for -no-pie, can lead to weirdness.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Explicitly disable builtins for mem*() overrides
Sean Christopherson [Tue, 13 Dec 2022 00:16:48 +0000 (00:16 +0000)]
KVM: selftests: Explicitly disable builtins for mem*() overrides

Explicitly disable the compiler's builtin memcmp(), memcpy(), and
memset().  Because only lib/string_override.c is built with -ffreestanding,
the compiler reserves the right to do what it wants and can try to link the
non-freestanding code to its own crud.

  /usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in function `memcmp_ifunc':
  (.text+0x0): multiple definition of `memcmp'; tools/testing/selftests/kvm/lib/string_override.o:
  tools/testing/selftests/kvm/lib/string_override.c:15: first defined here
  clang: error: linker command failed with exit code 1 (use -v to see invocation)

Fixes: 6b6f71484bf4 ("KVM: selftests: Implement memcmp(), memcpy(), and memset() for guest use")
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Probe -no-pie with actual CFLAGS used to compile
Sean Christopherson [Tue, 13 Dec 2022 00:16:47 +0000 (00:16 +0000)]
KVM: selftests: Probe -no-pie with actual CFLAGS used to compile

Probe -no-pie with the actual set of CFLAGS used to compile the tests,
clang whines about -no-pie being unused if the tests are compiled with
-static.

  clang: warning: argument unused during compilation: '-no-pie'
  [-Wunused-command-line-argument]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Use proper function prototypes in probing code
Sean Christopherson [Tue, 13 Dec 2022 00:16:46 +0000 (00:16 +0000)]
KVM: selftests: Use proper function prototypes in probing code

Make the main() functions in the probing code proper prototypes so that
compiling the probing code with more strict flags won't generate false
negatives.

  <stdin>:1:5: error: function declaration isn’t a prototype [-Werror=strict-prototypes]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-8-seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86
Sean Christopherson [Tue, 13 Dec 2022 00:16:45 +0000 (00:16 +0000)]
KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86

Rename UNAME_M to ARCH_DIR and explicitly set it directly for x86.  At
this point, the name of the arch directory really doesn't have anything
to do with `uname -m`, and UNAME_M is unnecessarily confusing given that
its purpose is purely to identify the arch specific directory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Fix a typo in x86-64's kvm_get_cpu_address_width()
Sean Christopherson [Tue, 13 Dec 2022 00:16:44 +0000 (00:16 +0000)]
KVM: selftests: Fix a typo in x86-64's kvm_get_cpu_address_width()

Fix a == vs. = typo in kvm_get_cpu_address_width() that results in
@pa_bits being left unset if the CPU doesn't support enumerating its
MAX_PHY_ADDR.  Flagged by clang's unusued-value warning.

lib/x86_64/processor.c:1034:51: warning: expression result unused [-Wunused-value]
                *pa_bits == kvm_cpu_has(X86_FEATURE_PAE) ? 36 : 32;

Fixes: 3bd396353d18 ("KVM: selftests: Add X86_FEATURE_PAE and use it calc "fallback" MAXPHYADDR")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Use pattern matching in .gitignore
Sean Christopherson [Tue, 13 Dec 2022 00:16:43 +0000 (00:16 +0000)]
KVM: selftests: Use pattern matching in .gitignore

Use pattern matching to exclude everything except .c, .h, .S, and .sh
files from Git.  Manually adding every test target has an absurd
maintenance cost, is comically error prone, and leads to bikeshedding
over whether or not the targets should be listed in alphabetical order.

Deliberately do not include the one-off assets, e.g. config, settings,
.gitignore itself, etc as Git doesn't ignore files that are already in
the repository.  Adding the one-off assets won't prevent mistakes where
developers forget to --force add files that don't match the "allowed".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Fix divide-by-zero bug in memslot_perf_test
Sean Christopherson [Tue, 13 Dec 2022 00:16:42 +0000 (00:16 +0000)]
KVM: selftests: Fix divide-by-zero bug in memslot_perf_test

Check that the number of pages per slot is non-zero in get_max_slots()
prior to computing the remaining number of pages.  clang generates code
that uses an actual DIV for calculating the remaining, which causes a #DE
if the total number of pages is less than the number of slots.

  traps: memslot_perf_te[97611] trap divide error ip:4030c4 sp:7ffd18ae58f0
         error:0 in memslot_perf_test[401000+cb000]

Fixes: a69170c65acd ("KVM: selftests: memslot_perf_test: Report optimal memory slots")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Delete dead code in x86_64/vmx_tsc_adjust_test.c
Sean Christopherson [Tue, 13 Dec 2022 00:16:41 +0000 (00:16 +0000)]
KVM: selftests: Delete dead code in x86_64/vmx_tsc_adjust_test.c

Delete an unused struct definition in x86_64/vmx_tsc_adjust_test.c.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Define literal to asm constraint in aarch64 as unsigned long
Sean Christopherson [Tue, 13 Dec 2022 00:16:40 +0000 (00:16 +0000)]
KVM: selftests: Define literal to asm constraint in aarch64 as unsigned long

Define a literal '0' asm input constraint to aarch64/page_fault_test's
guest_cas() as an unsigned long to make clang happy.

  tools/testing/selftests/kvm/aarch64/page_fault_test.c:120:16: error:
    value size does not match register size specified by the constraint
    and modifier [-Werror,-Wasm-operand-widths]
                       :: "r" (0), "r" (TEST_DATA), "r" (guest_test_memory));
                               ^
  tools/testing/selftests/kvm/aarch64/page_fault_test.c:119:15: note:
    use constraint modifier "w"
                       "casal %0, %1, [%2]\n"
                              ^~
                              %w0

Fixes: 35c581015712 ("KVM: selftests: aarch64: Add aarch64/page_fault_test")
Cc: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoLinux 6.2-rc1
Linus Torvalds [Sun, 25 Dec 2022 21:41:39 +0000 (13:41 -0800)]
Linux 6.2-rc1

18 months agotreewide: Convert del_timer*() to timer_shutdown*()
Steven Rostedt (Google) [Tue, 20 Dec 2022 18:45:19 +0000 (13:45 -0500)]
treewide: Convert del_timer*() to timer_shutdown*()

Due to several bugs caused by timers being re-armed after they are
shutdown and just before they are freed, a new state of timers was added
called "shutdown".  After a timer is set to this state, then it can no
longer be re-armed.

The following script was run to find all the trivial locations where
del_timer() or del_timer_sync() is called in the same function that the
object holding the timer is freed.  It also ignores any locations where
the timer->function is modified between the del_timer*() and the free(),
as that is not considered a "trivial" case.

This was created by using a coccinelle script and the following
commands:

    $ cat timer.cocci
    @@
    expression ptr, slab;
    identifier timer, rfield;
    @@
    (
    -       del_timer(&ptr->timer);
    +       timer_shutdown(&ptr->timer);
    |
    -       del_timer_sync(&ptr->timer);
    +       timer_shutdown_sync(&ptr->timer);
    )
      ... when strict
          when != ptr->timer
    (
            kfree_rcu(ptr, rfield);
    |
            kmem_cache_free(slab, ptr);
    |
            kfree(ptr);
    )

    $ spatch timer.cocci . > /tmp/t.patch
    $ patch -p1 < /tmp/t.patch

Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ]
Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ]
Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
18 months agoMerge tag 'spi-fix-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Linus Torvalds [Fri, 23 Dec 2022 22:44:08 +0000 (14:44 -0800)]
Merge tag 'spi-fix-v6.2-rc1' of git://git./linux/kernel/git/broonie/spi

Pull spi fix from Mark Brown:
 "One driver specific change here which handles the case where a SPI
  device for some reason tries to change the bus speed during a message
  on fsl_spi hardware, this should be very unusual"

* tag 'spi-fix-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: fsl_spi: Don't change speed while chipselect is active

18 months agoMerge tag 'regulator-fix-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 23 Dec 2022 22:38:00 +0000 (14:38 -0800)]
Merge tag 'regulator-fix-v6.2-rc1' of git://git./linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
 "Two core fixes here, one for a long standing race which some Qualcomm
  systems have started triggering with their UFS driver and another
  fixing a problem with supply lookup introduced by the fixes for devm
  related use after free issues that were introduced in this merge
  window"

* tag 'regulator-fix-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: core: fix deadlock on regulator enable
  regulator: core: Fix resolve supply lookup issue

18 months agoMerge tag 'coccinelle-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall...
Linus Torvalds [Fri, 23 Dec 2022 21:56:41 +0000 (13:56 -0800)]
Merge tag 'coccinelle-6.2' of git://git./linux/kernel/git/jlawall/linux

Pull coccicheck update from Julia Lawall:
 "Modernize use of grep in coccicheck:

  Use 'grep -E' instead of 'egrep'"

* tag 'coccinelle-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
  scripts: coccicheck: use "grep -E" instead of "egrep"

18 months agoMerge tag 'hardening-v6.2-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 23 Dec 2022 20:00:24 +0000 (12:00 -0800)]
Merge tag 'hardening-v6.2-rc1-fixes' of git://git./linux/kernel/git/kees/linux

Pull kernel hardening fixes from Kees Cook:

 - Fix CFI failure with KASAN (Sami Tolvanen)

 - Fix LKDTM + CFI under GCC 7 and 8 (Kristina Martsenko)

 - Limit CONFIG_ZERO_CALL_USED_REGS to Clang > 15.0.6 (Nathan
   Chancellor)

 - Ignore "contents" argument in LoadPin's LSM hook handling

 - Fix paste-o in /sys/kernel/warn_count API docs

 - Use READ_ONCE() consistently for oops/warn limit reading

* tag 'hardening-v6.2-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  cfi: Fix CFI failure with KASAN
  exit: Use READ_ONCE() for all oops/warn limit reads
  security: Restrict CONFIG_ZERO_CALL_USED_REGS to gcc or clang > 15.0.6
  lkdtm: cfi: Make PAC test work with GCC 7 and 8
  docs: Fix path paste-o for /sys/kernel/warn_count
  LoadPin: Ignore the "contents" argument of the LSM hooks

18 months agoMerge tag 'pstore-v6.2-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 23 Dec 2022 19:55:54 +0000 (11:55 -0800)]
Merge tag 'pstore-v6.2-rc1-fixes' of git://git./linux/kernel/git/kees/linux

Pull pstore fixes from Kees Cook:

 - Switch pmsg_lock to an rt_mutex to avoid priority inversion (John
   Stultz)

 - Correctly assign mem_type property (Luca Stefani)

* tag 'pstore-v6.2-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Properly assign mem_type property
  pstore: Make sure CONFIG_PSTORE_PMSG selects CONFIG_RT_MUTEXES
  pstore: Switch pmsg_lock to an rt_mutex to avoid priority inversion

18 months agoMerge tag 'dma-mapping-2022-12-23' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Fri, 23 Dec 2022 19:44:20 +0000 (11:44 -0800)]
Merge tag 'dma-mapping-2022-12-23' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fix up the sound code to not pass __GFP_COMP to the non-coherent DMA
  allocator, as it copes with that just as badly as the coherent
  allocator, and then add a check to make sure no one passes the flag
  ever again"

* tag 'dma-mapping-2022-12-23' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: reject GFP_COMP for noncoherent allocations
  ALSA: memalloc: don't use GFP_COMP for non-coherent dma allocations

18 months agoMerge tag '9p-for-6.2-rc1' of https://github.com/martinetd/linux
Linus Torvalds [Fri, 23 Dec 2022 19:39:18 +0000 (11:39 -0800)]
Merge tag '9p-for-6.2-rc1' of https://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:

 - improve p9_check_errors to check buffer size instead of msize when
   possible (e.g. not zero-copy)

 - some more syzbot and KCSAN fixes

 - minor headers include cleanup

* tag '9p-for-6.2-rc1' of https://github.com/martinetd/linux:
  9p/client: fix data race on req->status
  net/9p: fix response size check in p9_check_errors()
  net/9p: distinguish zero-copy requests
  9p/xen: do not memcpy header into req->rc
  9p: set req refcount to zero to avoid uninitialized usage
  9p/net: Remove unneeded idr.h #include
  9p/fs: Remove unneeded idr.h #include

18 months agoMerge tag 'sound-6.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 23 Dec 2022 19:15:48 +0000 (11:15 -0800)]
Merge tag 'sound-6.2-rc1-2' of git://git./linux/kernel/git/tiwai/sound

Pull more sound updates from Takashi Iwai:
 "A few more updates for 6.2: most of changes are about ASoC
  device-specific fixes.

   - Lots of ASoC Intel AVS extensions and refactoring

   - Quirks for ASoC Intel SOF as well as regression fixes

   - ASoC Mediatek and Rockchip fixes

   - Intel HD-audio HDMI workarounds

   - Usual HD- and USB-audio device-specific quirks"

* tag 'sound-6.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (54 commits)
  ALSA: usb-audio: Add new quirk FIXED_RATE for JBL Quantum810 Wireless
  ALSA: azt3328: Remove the unused function snd_azf3328_codec_outl()
  ASoC: lochnagar: Fix unused lochnagar_of_match warning
  ASoC: Intel: Add HP Stream 8 to bytcr_rt5640.c
  ASoC: SOF: mediatek: initialize panic_info to zero
  ASoC: rt5670: Remove unbalanced pm_runtime_put()
  ASoC: Intel: bytcr_rt5640: Add quirk for the Advantech MICA-071 tablet
  ASoC: Intel: soc-acpi: update codec addr on 0C11/0C4F product
  ASoC: rockchip: spdif: Add missing clk_disable_unprepare() in rk_spdif_runtime_resume()
  ASoC: wm8994: Fix potential deadlock
  ASoC: mediatek: mt8195: add sof be ops to check audio active
  ASoC: SOF: Revert: "core: unregister clients and machine drivers in .shutdown"
  ASoC: SOF: Intel: pci-tgl: unblock S5 entry if DMA stop has failed"
  ALSA: hda/hdmi: fix stream-id config keep-alive for rt suspend
  ALSA: hda/hdmi: set default audio parameters for KAE silent-stream
  ALSA: hda/hdmi: fix i915 silent stream programming flow
  ALSA: hda: Error out if invalid stream is being setup
  ASoC: dt-bindings: fsl-sai: Reinstate i.MX93 SAI compatible string
  ASoC: soc-pcm.c: Clear DAIs parameters after stream_active is updated
  ASoC: codecs: wcd-clsh: Remove the unused function
  ...

18 months agoMerge tag 'drm-next-2022-12-23' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 23 Dec 2022 19:09:44 +0000 (11:09 -0800)]
Merge tag 'drm-next-2022-12-23' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "Holiday fixes!

  Two batches from amd, and one group of i915 changes.

  amdgpu:
   - Spelling fix
   - BO pin fix
   - Properly handle polaris 10/11 overlap asics
   - GMC9 fix
   - SR-IOV suspend fix
   - DCN 3.1.4 fix
   - KFD userptr locking fix
   - SMU13.x fixes
   - GDS/GWS/OA handling fix
   - Reserved VMID handling fixes
   - FRU EEPROM fix
   - BO validation fixes
   - Avoid large variable on the stack
   - S0ix fixes
   - SMU 13.x fixes
   - VCN fix
   - Add missing fence reference

  amdkfd:
   - Fix init vm error handling
   - Fix double release of compute pasid

  i915
   - Documentation fixes
   - OA-perf related fix
   - VLV/CHV HDMI/DP audio fix
   - Display DDI/Transcoder fix
   - Migrate fixes"

* tag 'drm-next-2022-12-23' of git://anongit.freedesktop.org/drm/drm: (39 commits)
  drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency
  drm/amdgpu: enable VCN DPG for GC IP v11.0.4
  drm/amdgpu: skip mes self test after s0i3 resume for MES IP v11.0
  drm/amd/pm: correct the fan speed retrieving in PWM for some SMU13 asics
  drm/amd/pm: bump SMU13.0.0 driver_if header to version 0x34
  drm/amdgpu: skip MES for S0ix as well since it's part of GFX
  drm/amd/pm: avoid large variable on kernel stack
  drm/amdkfd: Fix double release compute pasid
  drm/amdkfd: Fix kfd_process_device_init_vm error handling
  drm/amd/pm: update SMU13.0.0 reported maximum shader clock
  drm/amd/pm: correct SMU13.0.0 pstate profiling clock settings
  drm/amd/pm: enable GPO dynamic control support for SMU13.0.7
  drm/amd/pm: enable GPO dynamic control support for SMU13.0.0
  drm/amdgpu: revert "generally allow over-commit during BO allocation"
  drm/amdgpu: Remove unnecessary domain argument
  drm/amdgpu: Fix size validation for non-exclusive domains (v4)
  drm/amdgpu: Check if fru_addr is not NULL (v2)
  drm/i915/ttm: consider CCS for backup objects
  drm/i915/migrate: fix corner case in CCS aux copying
  drm/amdgpu: rework reserved VMID handling
  ...

18 months agoMerge tag 'mips_6.2_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Linus Torvalds [Fri, 23 Dec 2022 18:49:45 +0000 (10:49 -0800)]
Merge tag 'mips_6.2_1' of git://git./linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:
 "Fixes due to DT changes"

* tag 'mips_6.2_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: dts: bcm63268: Add missing properties to the TWD node
  MIPS: ralink: mt7621: avoid to init common ralink reset controller

18 months agoMerge tag 'mm-hotfixes-stable-2022-12-22-14-34' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 23 Dec 2022 18:45:00 +0000 (10:45 -0800)]
Merge tag 'mm-hotfixes-stable-2022-12-22-14-34' of git://git./linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Eight fixes, all cc:stable. One is for gcov and the remainder are MM"

* tag 'mm-hotfixes-stable-2022-12-22-14-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  gcov: add support for checksum field
  test_maple_tree: add test for mas_spanning_rebalance() on insufficient data
  maple_tree: fix mas_spanning_rebalance() on insufficient data
  hugetlb: really allocate vma lock for all sharable vmas
  kmsan: export kmsan_handle_urb
  kmsan: include linux/vmalloc.h
  mm/mempolicy: fix memory leak in set_mempolicy_home_node system call
  mm, mremap: fix mremap() expanding vma with addr inside vma

18 months agopstore: Properly assign mem_type property
Luca Stefani [Thu, 22 Dec 2022 13:10:49 +0000 (14:10 +0100)]
pstore: Properly assign mem_type property

If mem-type is specified in the device tree
it would end up overriding the record_size
field instead of populating mem_type.

As record_size is currently parsed after the
improper assignment with default size 0 it
continued to work as expected regardless of the
value found in the device tree.

Simply changing the target field of the struct
is enough to get mem-type working as expected.

Fixes: 9d843e8fafc7 ("pstore: Add mem_type property DT parsing support")
Cc: stable@vger.kernel.org
Signed-off-by: Luca Stefani <luca@osomprivacy.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221222131049.286288-1-luca@osomprivacy.com
18 months agopstore: Make sure CONFIG_PSTORE_PMSG selects CONFIG_RT_MUTEXES
John Stultz [Wed, 21 Dec 2022 05:18:55 +0000 (05:18 +0000)]
pstore: Make sure CONFIG_PSTORE_PMSG selects CONFIG_RT_MUTEXES

In commit 76d62f24db07 ("pstore: Switch pmsg_lock to an rt_mutex
to avoid priority inversion") I changed a lock to an rt_mutex.

However, its possible that CONFIG_RT_MUTEXES is not enabled,
which then results in a build failure, as the 0day bot detected:
  https://lore.kernel.org/linux-mm/202212211244.TwzWZD3H-lkp@intel.com/

Thus this patch changes CONFIG_PSTORE_PMSG to select
CONFIG_RT_MUTEXES, which ensures the build will not fail.

Cc: Wei Wang <wvw@google.com>
Cc: Midas Chien<midaschieh@google.com>
Cc: Connor O'Brien <connoro@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: kernel test robot <lkp@intel.com>
Cc: kernel-team@android.com
Fixes: 76d62f24db07 ("pstore: Switch pmsg_lock to an rt_mutex to avoid priority inversion")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221221051855.15761-1-jstultz@google.com
18 months agocfi: Fix CFI failure with KASAN
Sami Tolvanen [Thu, 22 Dec 2022 22:57:47 +0000 (22:57 +0000)]
cfi: Fix CFI failure with KASAN

When CFI_CLANG and KASAN are both enabled, LLVM doesn't generate a
CFI type hash for asan.module_ctor functions in translation units
where CFI is disabled, which leads to a CFI failure during boot when
do_ctors calls the affected constructors:

  CFI failure at do_basic_setup+0x64/0x90 (target:
  asan.module_ctor+0x0/0x28; expected type: 0xa540670c)

Specifically, this happens because CFI is disabled for
kernel/cfi.c. There's no reason to keep CFI disabled here anymore, so
fix the failure by not filtering out CC_FLAGS_CFI for the file.

Note that https://reviews.llvm.org/rG3b14862f0a96 fixed the issue
where LLVM didn't emit CFI type hashes for any sanitizer constructors,
but now type hashes are emitted correctly for TUs that use CFI.

Link: https://github.com/ClangBuiltLinux/linux/issues/1742
Fixes: 89245600941e ("cfi: Switch to -fsanitize=kcfi")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221222225747.3538676-1-samitolvanen@google.com
18 months agoKVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level
Sean Christopherson [Tue, 13 Dec 2022 03:30:29 +0000 (03:30 +0000)]
KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level

Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
match the target level of the fault, and instead have the vCPU retry the
faulting instruction after warning.  Continuing on is completely
unnecessary as the absolute worst case scenario of retrying is DoSing
the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.

  ------------[ cut here ]------------
  kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
  invalid opcode: 0000 [#1] SMP
  CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
  RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
  RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
  RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
  RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
  R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
  R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
  FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_mmu_map+0x3b0/0x510
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  Modules linked in: kvm_intel
  ---[ end trace 0000000000000000 ]---

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed
Sean Christopherson [Tue, 13 Dec 2022 03:30:28 +0000 (03:30 +0000)]
KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed

Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
when adding a new shadow page in the TDP MMU.  To ensure the NX reclaim
kthread can't see a not-yet-linked shadow page, the page fault path links
the new page table prior to adding the page to possible_nx_huge_pages.

If the page is zapped by different task, e.g. because dirty logging is
disabled, between linking the page and adding it to the list, KVM can end
up triggering use-after-free by adding the zapped SP to the aforementioned
list, as the zapped SP's memory is scheduled for removal via RCU callback.
The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
i.e. the below splat is just one possible signature.

  ------------[ cut here ]------------
  list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
  WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
  Modules linked in: kvm_intel
  CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #71
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__list_add_valid+0x79/0xa0
  RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
  RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
  R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
  R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
  FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   track_possible_nx_huge_page+0x53/0x80
   kvm_tdp_mmu_map+0x242/0x2c0
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
Reported-by: Greg Thelen <gthelen@google.com>
Analyzed-by: David Matlack <dmatlack@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
Sean Christopherson [Tue, 13 Dec 2022 03:30:27 +0000 (03:30 +0000)]
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached

Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached.  A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs.  However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down.   The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.

Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.

  ------------[ cut here ]------------
  WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
  Modules linked in: kvm_intel
  CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
  RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
  RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
  R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
  R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
  FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---
  Invalid SPTE change: cannot replace a present leaf
  SPTE with another present leaf SPTE mapping a
  different PFN!
  as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
  ------------[ cut here ]------------
  kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
  invalid opcode: 0000 [#1] SMP
  CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
  RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
  RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
  RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
  RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
  R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
  R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
  FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_mmu_map+0x3b0/0x510
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  Modules linked in: kvm_intel
  ---[ end trace 0000000000000000 ]---

Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen
Sean Christopherson [Tue, 13 Dec 2022 03:30:26 +0000 (03:30 +0000)]
KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen

Hoist the is_removed_spte() check above the "level == goal_level" check
when walking SPTEs during a TDP MMU page fault to avoid attempting to map
a leaf entry if said entry is frozen by a different task/vCPU.

  ------------[ cut here ]------------
  WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0
  Modules linked in: kvm_intel
  CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0
  RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246
  RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005
  RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000
  RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005
  R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000
  FS:  00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0x10c/0x130
   kvm_mmu_page_fault+0x103/0x680
   vmx_handle_exit+0x132/0x5a0 [kvm_intel]
   vcpu_enter_guest+0x60c/0x16f0
   kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
   kvm_vcpu_ioctl+0x271/0x660
   __x64_sys_ioctl+0x80/0xb0
   do_syscall_64+0x2b/0x50
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Robert Hoo <robert.hu@linux.intel.com>
Message-Id: <20221213033030.83345-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: nVMX: Don't stuff secondary execution control if it's not supported
Sean Christopherson [Tue, 13 Dec 2022 06:23:04 +0000 (06:23 +0000)]
KVM: nVMX: Don't stuff secondary execution control if it's not supported

When stuffing the allowed secondary execution controls for nested VMX in
response to CPUID updates, don't set the allowed-1 bit for a feature that
isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config.

WARN if KVM attempts to manipulate a feature that isn't supported.  All
features that are currently stuffed are always advertised to L1 for
nested VMX if they are supported in KVM's base configuration, and no
additional features should ever be added to the CPUID-induced stuffing
(updating VMX MSRs in response to CPUID updates is a long-standing KVM
flaw that is slowly being fixed).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213062306.667649-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1
Sean Christopherson [Tue, 13 Dec 2022 06:23:03 +0000 (06:23 +0000)]
KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1

Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the
feature is supported in hardware and enabled in KVM's base, non-nested
configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported.
This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail
if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and
obviously allows L1 to enable the feature for L2.

KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing
the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when
updating secondary controls in response to KVM_SET_CPUID(2), but (a) that
depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID
updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction
that the guest value must be a strict subset of the supported host value.

Although no past commit explicitly enabled nested support for WAITPKG,
doing so is safe and functionally correct from an architectural
perspective as no additional KVM support is needed to virtualize TPAUSE,
UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards
VM-Exits to L1 as necessary (commit bf653b78f960, "KVM: vmx: Introduce
handle_unexpected_vmexit and handle WAITPKG vmexit").

Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in
hardware, i.e. always runs both L1 and L2 with the host's power management
settings for TPAUSE and UMWAIT.  See commit bf09fb6cba4f ("KVM: VMX: Stop
context switching MSR_IA32_UMWAIT_CONTROL") for more details.

Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions")
Cc: stable@vger.kernel.org
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20221213062306.667649-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: nVMX: Document that ignoring memory failures for VMCLEAR is deliberate
Sean Christopherson [Tue, 20 Dec 2022 15:42:24 +0000 (15:42 +0000)]
KVM: nVMX: Document that ignoring memory failures for VMCLEAR is deliberate

Explicitly drop the result of kvm_vcpu_write_guest() when writing the
"launch state" as part of VMCLEAR emulation, and add a comment to call
out that KVM's behavior is architecturally valid.  Intel's pseudocode
effectively says that VMCLEAR is a nop if the target VMCS address isn't
in memory, e.g. if the address points at MMIO.

Add a FIXME to call out that suppressing failures on __copy_to_user() is
wrong, as memory (a memslot) does exist in that case.  Punt the issue to
the future as open coding kvm_vcpu_write_guest() just to make sure the
guest dies with -EFAULT isn't worth the extra complexity.  The flaw will
need to be addressed if KVM ever does something intelligent on uaccess
failures, e.g. to support post-copy demand paging, but in that case KVM
will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open
code a core KVM helper.

No functional change intended.

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527765 ("Error handling issues")
Fixes: 587d7e72aedc ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220154224.526568-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: selftests: Zero out valid_bank_mask for "all" case in Hyper-V IPI test
Sean Christopherson [Mon, 19 Dec 2022 22:04:16 +0000 (22:04 +0000)]
KVM: selftests: Zero out valid_bank_mask for "all" case in Hyper-V IPI test

Zero out the valid_bank_mask when using the fast variant of
HVCALL_SEND_IPI_EX to send IPIs to all vCPUs.  KVM requires the "var_cnt"
and "valid_bank_mask" inputs to be consistent even when targeting all
vCPUs.  See commit bd1ba5732bb9 ("KVM: x86: Get the number of Hyper-V
sparse banks from the VARHEAD field").

Fixes: 998489245d84 ("KVM: selftests: Hyper-V PV IPI selftest")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221219220416.395329-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: Sanity check inputs to kvm_handle_memory_failure()
Sean Christopherson [Tue, 20 Dec 2022 15:34:27 +0000 (15:34 +0000)]
KVM: x86: Sanity check inputs to kvm_handle_memory_failure()

Add a sanity check in kvm_handle_memory_failure() to assert that a valid
x86_exception structure is provided if the memory "failure" wants to
propagate a fault into the guest.  If a memory failure happens during a
direct guest physical memory access, e.g. for nested VMX, KVM hardcodes
the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer
(because the exception struct would just be filled with garbage).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220153427.514032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: Simplify kvm_apic_hw_enabled
Peng Hao [Tue, 6 Dec 2022 09:20:15 +0000 (17:20 +0800)]
KVM: x86: Simplify kvm_apic_hw_enabled

kvm_apic_hw_enabled() only needs to return bool, there is no place
to use the return value of MSR_IA32_APICBASE_ENABLE.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: hyper-v: Fix 'using uninitialized value' Coverity warning
Vitaly Kuznetsov [Thu, 8 Dec 2022 10:27:00 +0000 (11:27 +0100)]
KVM: x86: hyper-v: Fix 'using uninitialized value' Coverity warning

In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables
are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM
halves' and increase 'data_offset' otherwise. Coverity discovered, that in
one case both variables are incremented unconditionally. This doesn't seem
to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves'
data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also
takes into account 'hc->fast' but is still worth fixing.

To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to
'struct kvm_hv_hcall' as a union and use at call sites. This allows to
remove explicit 'data_offset'/'consumed_xmm_halves' parameters from
kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries()
helpers.

Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and
is not zeroed, consumers are supposed to initialize the appropriate field
if needed.

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527764 ("Uninitialized variables")
Fixes: 260970862c88 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221208102700.959630-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race
Adamos Ttofari [Thu, 8 Dec 2022 09:44:14 +0000 (09:44 +0000)]
KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race

When scanning userspace I/OAPIC entries, intercept EOI for level-triggered
IRQs if the current vCPU has a pending and/or in-service IRQ for the
vector in its local API, even if the vCPU doesn't match the new entry's
destination.  This fixes a race between userspace I/OAPIC reconfiguration
and IRQ delivery that results in the vector's bit being left set in the
remote IRR due to the eventual EOI not being forwarded to the userspace
I/OAPIC.

Commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC
reconfigure race") fixed the in-kernel IOAPIC, but not the userspace
IOAPIC configuration, which has a similar race.

Fixes: 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race")

Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221208094415.12723-1-attofari@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoKVM: x86/pmu: Prevent zero period event from being repeatedly released
Like Xu [Wed, 7 Dec 2022 07:15:05 +0000 (15:15 +0800)]
KVM: x86/pmu: Prevent zero period event from being repeatedly released

The current vPMU can reuse the same pmc->perf_event for the same
hardware event via pmc_pause/resume_counter(), but this optimization
does not apply to a portion of the TSX events (e.g., "event=0x3c,in_tx=1,
in_tx_cp=1"), where event->attr.sample_period is legally zero at creation,
thus making the perf call to perf_event_period() meaningless (no need to
adjust sample period in this case), and instead causing such reusable
perf_events to be repeatedly released and created.

Avoid releasing zero sample_period events by checking is_sampling_event()
to follow the previously enable/disable optimization.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20221207071506.15733-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
18 months agoMerge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Thu, 22 Dec 2022 19:22:31 +0000 (11:22 -0800)]
Merge tag 'scsi-misc' of git://git./linux/kernel/git/jejb/scsi

Pull more SCSI updates from James Bottomley:
 "Mostly small bug fixes and small updates.

  The only things of note is a qla2xxx fix for crash on hotplug and
  timeout and the addition of a user exposed abstraction layer for
  persistent reservation error return handling (which necessitates the
  conversion of nvme.c as well as SCSI)"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: Fix crash when I/O abort times out
  nvme: Convert NVMe errors to PR errors
  scsi: sd: Convert SCSI errors to PR errors
  scsi: core: Rename status_byte to sg_status_byte
  block: Add error codes for common PR failures
  scsi: sd: sd_zbc: Trace zone append emulation
  scsi: libfc: Include the correct header

18 months agoMerge tag 'afs-next-20221222' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowel...
Linus Torvalds [Thu, 22 Dec 2022 19:17:34 +0000 (11:17 -0800)]
Merge tag 'afs-next-20221222' of git://git./linux/kernel/git/dhowells/linux-fs

Pull afs update from David Howells:
 "A fix for a couple of missing resource counter decrements, two small
  cleanups of now-unused bits of code and a patch to remove writepage
  support from afs"

* tag 'afs-next-20221222' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Stop implementing ->writepage()
  afs: remove afs_cache_netfs and afs_zap_permits() declarations
  afs: remove variable nr_servers
  afs: Fix lost servers_outstanding count

18 months agoMerge tag 'perf-tools-for-v6.2-2-2022-12-22' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Thu, 22 Dec 2022 19:07:29 +0000 (11:07 -0800)]
Merge tag 'perf-tools-for-v6.2-2-2022-12-22' of git://git./linux/kernel/git/acme/linux

Pull more perf tools updates from Arnaldo Carvalho de Melo:
 "perf tools fixes and improvements:

   - Don't stop building perf if python setuptools isn't installed, just
     disable the affected perf feature.

   - Remove explicit reference to python 2.x devel files, that warning
     is about python-devel, no matter what version, being unavailable
     and thus disabling the linking with libpython.

   - Don't use -Werror=switch-enum when building the python support that
     handles libtraceevent enumerations, as there is no good way to test
     if some specific enum entry is available with the libtraceevent
     installed on the system.

   - Introduce 'perf lock contention' --type-filter and --lock-filter,
     to filter by lock type and lock name:

        $ sudo ./perf lock record -a -- ./perf bench sched messaging

        $ sudo ./perf lock contention -E 5 -Y spinlock
         contended  total wait   max wait  avg wait      type  caller

               802     1.26 ms   11.73 us   1.58 us  spinlock  __wake_up_common_lock+0x62
                13   787.16 us  105.44 us  60.55 us  spinlock  remove_wait_queue+0x14
                12   612.96 us   78.70 us  51.08 us  spinlock  prepare_to_wait+0x27
               114   340.68 us   12.61 us   2.99 us  spinlock  try_to_wake_up+0x1f5
                83   226.38 us    9.15 us   2.73 us  spinlock  folio_lruvec_lock_irqsave+0x5e

        $ sudo ./perf lock contention -l
         contended  total wait  max wait  avg wait           address  symbol

                57     1.11 ms  42.83 us  19.54 us  ffff9f4140059000
                15   280.88 us  23.51 us  18.73 us  ffffffff9d007a40  jiffies_lock
                 1    20.49 us  20.49 us  20.49 us  ffffffff9d0d50c0  rcu_state
                 1     9.02 us   9.02 us   9.02 us  ffff9f41759e9ba0

        $ sudo ./perf lock contention -L jiffies_lock,rcu_state
         contended  total wait  max wait  avg wait      type  caller

                15   280.88 us  23.51 us  18.73 us  spinlock  tick_sched_do_timer+0x93
                 1    20.49 us  20.49 us  20.49 us  spinlock  __softirqentry_text_start+0xeb

        $ sudo ./perf lock contention -L ffff9f4140059000
         contended  total wait  max wait  avg wait      type  caller

                38   779.40 us  42.83 us  20.51 us  spinlock  worker_thread+0x50
                11   216.30 us  39.87 us  19.66 us  spinlock  queue_work_on+0x39
                 8   118.13 us  20.51 us  14.77 us  spinlock  kthread+0xe5

   - Fix splitting CC into compiler and options when checking if a
     option is present in clang to build the python binding, needed in
     systems such as yocto that set CC to, e.g.: "gcc --sysroot=/a/b/c".

   - Refresh metris and events for Intel systems: alderlake.
     alderlake-n, bonnell, broadwell, broadwellde, broadwellx,
     cascadelakex, elkhartlake, goldmont, goldmontplus, haswell,
     haswellx, icelake, icelakex, ivybridge, ivytown, jaketown,
     knightslanding, meteorlake, nehalemep, nehalemex, sandybridge,
     sapphirerapids, silvermont, skylake, skylakex, snowridgex,
     tigerlake, westmereep-dp, westmereep-sp, westmereex.

   - Add vendor events files (JSON) for AMD Zen 4, from sections
     2.1.15.4 "Core Performance Monitor Counters", 2.1.15.5 "L3 Cache
     Performance Monitor Counter"s and Section 7.1 "Fabric Performance
     Monitor Counter (PMC) Events" in the Processor Programming
     Reference (PPR) for AMD Family 19h Model 11h Revision B1
     processors.

     This constitutes events which capture op dispatch, execution and
     retirement, branch prediction, L1 and L2 cache activity, TLB
     activity, L3 cache activity and data bandwidth for various links
     and interfaces in the Data Fabric.

   - Also, from the same PPR are metrics taken from Section 2.1.15.2
     "Performance Measurement", including pipeline utilization, which
     are new to Zen 4 processors and useful for finding performance
     bottlenecks by analyzing activity at different stages of the
     pipeline.

   - Greatly improve the 'srcline', 'srcline_from', 'srcline_to' and
     'srcfile' sort keys performance by postponing calling the external
     addr2line utility to the collapse phase of histogram bucketing.

   - Fix 'perf test' "all PMU test" to skip parametrized events, that
     requires setting up and are not supported by this test.

   - Update tools/ copies of kernel headers: features,
     disabled-features, fscrypt.h, i915_drm.h, msr-index.h, power pc
     syscall table and kvm.h.

   - Add .DELETE_ON_ERROR special Makefile target to clean up partially
     updated files on error.

   - Simplify the mksyscalltbl script for arm64 by avoiding to run the
     host compiler to create the syscall table, do it all just with the
     shell script.

   - Further fixes to honour quiet mode (-q)"

* tag 'perf-tools-for-v6.2-2-2022-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (67 commits)
  perf python: Fix splitting CC into compiler and options
  perf scripting python: Don't be strict at handling libtraceevent enumerations
  perf arm64: Simplify mksyscalltbl
  perf build: Remove explicit reference to python 2.x devel files
  perf vendor events amd: Add Zen 4 mapping
  perf vendor events amd: Add Zen 4 metrics
  perf vendor events amd: Add Zen 4 uncore events
  perf vendor events amd: Add Zen 4 core events
  perf vendor events intel: Refresh westmereex events
  perf vendor events intel: Refresh westmereep-sp events
  perf vendor events intel: Refresh westmereep-dp events
  perf vendor events intel: Refresh tigerlake metrics and events
  perf vendor events intel: Refresh snowridgex events
  perf vendor events intel: Refresh skylakex metrics and events
  perf vendor events intel: Refresh skylake metrics and events
  perf vendor events intel: Refresh silvermont events
  perf vendor events intel: Refresh sapphirerapids metrics and events
  perf vendor events intel: Refresh sandybridge metrics and events
  perf vendor events intel: Refresh nehalemex events
  perf vendor events intel: Refresh nehalemep events
  ...

18 months agoperf python: Fix splitting CC into compiler and options
Arnaldo Carvalho de Melo [Thu, 22 Dec 2022 13:56:25 +0000 (10:56 -0300)]
perf python: Fix splitting CC into compiler and options

Noticed this build failure on archlinux:base when building with clang:

  clang-14: error: optimization flag '-ffat-lto-objects' is not supported [-Werror,-Wignored-optimization-argument]

In tools/perf/util/setup.py we check if clang supports that option, but
since commit 3cad53a6f9cdbafa ("perf python: Account for multiple words
in CC") this got broken as in the common case where CC="clang":

  >>> cc="clang"
  >>> print(cc.split()[0])
  clang
  >>> option="-ffat-lto-objects"
  >>> print(str(cc.split()[1:]) + option)
  []-ffat-lto-objects
  >>>

And then the Popen will call clang with that bogus option name that in
turn will not produce the b"unknown argument" or b"is not supported"
that this function uses to detect if the option is not available and
thus later on clang will be called with an unknown/unsupported option.

Fix it by looking if really there are options in the provided CC
variable, and if so override 'cc' with the first token and append the
options to the 'option' variable.

Fixes: 3cad53a6f9cdbafa ("perf python: Account for multiple words in CC")
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Keeping <john@metanate.com>
Cc: Khem Raj <raj.khem@gmail.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Link: http://lore.kernel.org/lkml/Y6Rq5F5NI0v1QQHM@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
18 months agoafs: Stop implementing ->writepage()
David Howells [Fri, 18 Nov 2022 07:57:27 +0000 (07:57 +0000)]
afs: Stop implementing ->writepage()

We're trying to get rid of the ->writepage() hook[1].  Stop afs from using
it by unlocking the page and calling afs_writepages_region() rather than
folio_write_one().

A flag is passed to afs_writepages_region() to indicate that it should only
write a single region so that we don't flush the entire file in
->write_begin(), but do add other dirty data to the region being written to
try and reduce the number of RPC ops.

This requires ->migrate_folio() to be implemented, so point that at
filemap_migrate_folio() for files and also for symlinks and directories.

This can be tested by turning on the afs_folio_dirty tracepoint and then
doing something like:

   xfs_io -c "w 2223 7000" -c "w 15000 22222" -c "w 23 7" /afs/my/test/foo

and then looking in the trace to see if the write at position 15000 gets
stored before page 0 gets dirtied for the write at position 23.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20221113162902.883850-1-hch@lst.de/
Link: https://lore.kernel.org/r/166876785552.222254.4403222906022558715.stgit@warthog.procyon.org.uk/
18 months agoafs: remove afs_cache_netfs and afs_zap_permits() declarations
Gaosheng Cui [Fri, 9 Sep 2022 07:03:53 +0000 (15:03 +0800)]
afs: remove afs_cache_netfs and afs_zap_permits() declarations

afs_zap_permits() has been removed since
commit be080a6f43c4 ("afs: Overhaul permit caching").

afs_cache_netfs has been removed since
commit 523d27cda149 ("afs: Convert afs to use the new fscache API").

so remove the declare for them from header file.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20220909070353.1160228-1-cuigaosheng1@huawei.com/
18 months agoafs: remove variable nr_servers
Colin Ian King [Thu, 20 Oct 2022 17:39:23 +0000 (18:39 +0100)]
afs: remove variable nr_servers

Variable nr_servers is no longer being used, the last reference
to it was removed in commit 45df8462730d ("afs: Fix server list handling")
so clean up the code by removing it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20221020173923.21342-1-colin.i.king@gmail.com/
18 months agoafs: Fix lost servers_outstanding count
David Howells [Wed, 21 Dec 2022 14:30:48 +0000 (14:30 +0000)]
afs: Fix lost servers_outstanding count

The afs_fs_probe_dispatcher() work function is passed a count on
net->servers_outstanding when it is scheduled (which may come via its
timer).  This is passed back to the work_item, passed to the timer or
dropped at the end of the dispatcher function.

But, at the top of the dispatcher function, there are two checks which
skip the rest of the function: if the network namespace is being destroyed
or if there are no fileservers to probe.  These two return paths, however,
do not drop the count passed to the dispatcher, and so, sometimes, the
destruction of a network namespace, such as induced by rmmod of the kafs
module, may get stuck in afs_purge_servers(), waiting for
net->servers_outstanding to become zero.

Fix this by adding the missing decrements in afs_fs_probe_dispatcher().

Fixes: f6cbb368bcb0 ("afs: Actively poll fileservers to maintain NAT or firewall openings")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/167164544917.2072364.3759519569649459359.stgit@warthog.procyon.org.uk/
18 months agoMerge tag 'asoc-v6.2-3' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie...
Takashi Iwai [Thu, 22 Dec 2022 08:18:38 +0000 (09:18 +0100)]
Merge tag 'asoc-v6.2-3' of https://git./linux/kernel/git/broonie/sound into for-linus

ASoC: Updates for v6.2

Some more small fixes and board quirks that came in since my last
update, the main one being the fixes from Kai for issues around the
attempts to get kexec working well on SOF based systems.

18 months agoALSA: usb-audio: Add new quirk FIXED_RATE for JBL Quantum810 Wireless
Jaroslav Kysela [Thu, 15 Dec 2022 15:30:37 +0000 (16:30 +0100)]
ALSA: usb-audio: Add new quirk FIXED_RATE for JBL Quantum810 Wireless

It seems that the firmware is broken and does not accept
the UAC_EP_CS_ATTR_SAMPLE_RATE URB. There is only one rate (48000Hz)
available in the descriptors for the output endpoint.

Create a new quirk QUIRK_FLAG_FIXED_RATE to skip the rate setup
when only one rate is available (fixed).

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216798
Signed-off-by: Jaroslav Kysela <perex@perex.cz>
Link: https://lore.kernel.org/r/20221215153037.1163786-1-perex@perex.cz
Signed-off-by: Takashi Iwai <tiwai@suse.de>
18 months agoALSA: azt3328: Remove the unused function snd_azf3328_codec_outl()
Jiapeng Chong [Tue, 13 Dec 2022 06:13:55 +0000 (14:13 +0800)]
ALSA: azt3328: Remove the unused function snd_azf3328_codec_outl()

The function snd_azf3328_codec_outl is defined in the azt3328.c file, but
not called elsewhere, so remove this unused function.

sound/pci/azt3328.c:367:1: warning: unused function 'snd_azf3328_codec_outl'.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3432
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221213061355.62856-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
18 months agoMerge branch 'for-next' into for-linus
Takashi Iwai [Thu, 22 Dec 2022 08:11:48 +0000 (09:11 +0100)]
Merge branch 'for-next' into for-linus

18 months agoMerge tag 'trace-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux...
Linus Torvalds [Thu, 22 Dec 2022 03:03:42 +0000 (19:03 -0800)]
Merge tag 'trace-v6.2-1' of git://git./linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:
 "I missed this minor hardening of the kernel in the first pull.

   - Make monitor structures read only"

* tag 'trace-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv/monitors: Move monitor structure in rodata

18 months agoMerge tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Thu, 22 Dec 2022 02:57:24 +0000 (18:57 -0800)]
Merge tag 'trace-probes-v6.2' of git://git./linux/kernel/git/trace/linux-trace

Pull trace probes updates from Steven Rostedt:

 - New "symstr" type for dynamic events that writes the name of the
   function+offset into the ring buffer and not just the address

 - Prevent kernel symbol processing on addresses in user space probes
   (uprobes).

 - And minor fixes and clean ups

* tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Reject symbol/symstr type for uprobe
  tracing/probes: Add symstr type for dynamic events
  kprobes: kretprobe events missing on 2-core KVM guest
  kprobes: Fix check for probe enabled in kill_kprobe()
  test_kprobes: Fix implicit declaration error of test_kprobes
  tracing: Fix race where eprobes can be called before the event

18 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Thu, 22 Dec 2022 02:52:15 +0000 (18:52 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull RISC-V kvm updates from Paolo Bonzini:

 - Allow unloading KVM module

 - Allow KVM user-space to set mvendorid, marchid, and mimpid

 - Several fixes and cleanups

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  RISC-V: KVM: Add ONE_REG interface for mvendorid, marchid, and mimpid
  RISC-V: KVM: Save mvendorid, marchid, and mimpid when creating VCPU
  RISC-V: Export sbi_get_mvendorid() and friends
  RISC-V: KVM: Move sbi related struct and functions to kvm_vcpu_sbi.h
  RISC-V: KVM: Use switch-case in kvm_riscv_vcpu_set/get_reg()
  RISC-V: KVM: Remove redundant includes of asm/csr.h
  RISC-V: KVM: Remove redundant includes of asm/kvm_vcpu_timer.h
  RISC-V: KVM: Fix reg_val check in kvm_riscv_vcpu_set_reg_config()
  RISC-V: KVM: Simplify kvm_arch_prepare_memory_region()
  RISC-V: KVM: Exit run-loop immediately if xfer_to_guest fails
  RISC-V: KVM: use vma_lookup() instead of find_vma_intersection()
  RISC-V: KVM: Add exit logic to main.c