Michael S. Tsirkin [Thu, 7 Jun 2018 23:19:53 +0000 (02:19 +0300)]
kvm: fix typo in flag name
KVM_X86_DISABLE_EXITS_HTL really refers to exit on halt.
Obviously a typo: should be named KVM_X86_DISABLE_EXITS_HLT.
Fixes:
caa057a2cad ("KVM: X86: Provide a capability to disable HLT intercepts")
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 6 Jun 2018 15:38:09 +0000 (17:38 +0200)]
kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
The functions that were used in the emulation of fxrstor, fxsave, sgdt and
sidt were originally meant for task switching, and as such they did not
check privilege levels. This is very bad when the same functions are used
in the emulation of unprivileged instructions. This is CVE-2018-10853.
The obvious fix is to add a new argument to ops->read_std and ops->write_std,
which decides whether the access is a "system" access or should use the
processor's CPL.
Fixes:
129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 6 Jun 2018 15:37:49 +0000 (17:37 +0200)]
KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
Int the next patch the emulator's .read_std and .write_std callbacks will
grow another argument, which is not needed in kvm_read_guest_virt and
kvm_write_guest_virt_system's callers. Since we have to make separate
functions, let's give the currently existing names a nicer interface, too.
Fixes:
129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 6 Jun 2018 14:43:02 +0000 (16:43 +0200)]
KVM: x86: introduce linear_{read,write}_system
Wrap the common invocation of ctxt->ops->read_std and ctxt->ops->write_std, so
as to have a smaller patch when the functions grow another argument.
Fixes:
129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Felix Wilhelm [Mon, 11 Jun 2018 07:43:44 +0000 (09:43 +0200)]
kvm: nVMX: Enforce cpl=0 for VMX instructions
VMX instructions executed inside a L1 VM will always trigger a VM exit
even when executed with cpl 3. This means we must perform the
privilege check in software.
Fixes:
70f3aac964ae("kvm: nVMX: Remove superfluous VMX instruction fault checks")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Wilhelm <fwilhelm@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jim Mattson [Tue, 29 May 2018 16:11:33 +0000 (09:11 -0700)]
kvm: nVMX: Add support for "VMWRITE to any supported field"
Add support for "VMWRITE to any supported field in the VMCS" and
enable this feature by default in L1's IA32_VMX_MISC MSR. If userspace
clears the VMX capability bit, the old behavior will be restored.
Note that this feature is a prerequisite for kvm in L1 to use VMCS
shadowing, once that feature is available.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jim Mattson [Tue, 29 May 2018 16:11:32 +0000 (09:11 -0700)]
kvm: nVMX: Restrict VMX capability MSR changes
Disallow changes to the VMX capability MSRs while the vCPU is in VMX
operation. Although this does break the existing API, it helps to
avoid some potentially tricky situations for which there is no
architected behavior.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 29 May 2018 06:53:17 +0000 (14:53 +0800)]
KVM: VMX: Optimize tscdeadline timer latency
'Commit
d0659d946be0 ("KVM: x86: add option to advance tscdeadline
hrtimer expiration")' advances the tscdeadline (the timer is emulated
by hrtimer) expiration in order that the latency which is incurred
by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch
adds the advance tscdeadline expiration support to which the tscdeadline
timer is emulated by VMX preemption timer to reduce the hypervisor
lantency (handle_preemption_timer -> vmentry). The guest can also
set an expiration that is very small (for example in Linux if an
hrtimer feeds a expiration in the past); in that case we set delta_tsc
to 0, leading to an immediately vmexit when delta_tsc is not bigger than
advance ns.
This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on
a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
busy waits.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liran Alon [Sat, 26 May 2018 18:21:26 +0000 (21:21 +0300)]
KVM: docs: nVMX: Remove known limitations as they do not exist now
We can document other "Known Limitations" but the ones currently
referenced don't hold anymore...
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liran Alon [Sat, 26 May 2018 18:19:12 +0000 (21:19 +0300)]
KVM: docs: mmu: KVM support exposing SLAT to guests
Fix outdated statement that KVM is not able to expose SLAT
(Second-Layer-Address-Translation) to guests.
This was implemented a long time ago...
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Greg Kroah-Hartman [Tue, 29 May 2018 16:22:04 +0000 (18:22 +0200)]
kvm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
This cleans up the error handling a lot, as this code will never get
hit.
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmář" <rkrcmar@redhat.com>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.cs.columbia.edu
Cc: kvm@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Orr [Tue, 15 May 2018 11:37:37 +0000 (04:37 -0700)]
kvm: Make VM ioctl do valloc for some archs
The kvm struct has been bloating. For example, it's tens of kilo-bytes
for x86, which turns out to be a large amount of memory to allocate
contiguously via kzalloc. Thus, this patch does the following:
1. Uses architecture-specific routines to allocate the kvm struct via
vzalloc for x86.
2. Switches arm to __KVM_HAVE_ARCH_VM_ALLOC so that it can use vzalloc
when has_vhe() is true.
Other architectures continue to default to kalloc, as they have a
dependency on kalloc or have a small-enough struct kvm.
Signed-off-by: Marc Orr <marcorr@google.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Souptick Joarder [Wed, 18 Apr 2018 19:19:58 +0000 (00:49 +0530)]
kvm: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.
commit
1c8f422059ae ("mm: change return type to vm_fault_t")
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 1 Jun 2018 17:17:41 +0000 (19:17 +0200)]
Merge tag 'kvm-s390-next-4.18-1' of git://git./linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Cleanups for 4.18
- cleanups for nested, clock handling, crypto, storage keys and
control register bits
Paolo Bonzini [Fri, 1 Jun 2018 17:17:22 +0000 (19:17 +0200)]
Merge tag 'kvmarm-for-v4.18' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM updates for 4.18
- Lazy context-switching of FPSIMD registers on arm64
- Allow virtual redistributors to be part of two or more MMIO ranges
Liran Alon [Fri, 25 May 2018 23:46:54 +0000 (02:46 +0300)]
KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Jim Mattson [Thu, 24 May 2018 18:59:54 +0000 (11:59 -0700)]
kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
Document the subtle nuances that KVM_CAP_X86_DISABLE_EXITS induces in
the KVM_GET_SUPPORTED_CPUID API.
Fixes:
4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:31 +0000 (17:21 +0200)]
KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
We need a new capability to indicate support for the newly added
HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this
capability, userspace is supposed to announce PV TLB flush features
by setting the appropriate CPUID bits (if needed).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:30 +0000 (17:21 +0200)]
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
Implement HvFlushVirtualAddress{List,Space}Ex hypercalls in the same way
we've implemented non-EX counterparts.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
[Initialized valid_bank_mask to silence misguided GCC warnigs. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:29 +0000 (17:21 +0200)]
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
IN_GUEST_MODE.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:28 +0000 (17:21 +0200)]
KVM: introduce kvm_make_vcpus_request_mask() API
Hyper-V style PV TLB flush hypercalls inmplementation will use this API.
To avoid memory allocation in CONFIG_CPUMASK_OFFSTACK case add
cpumask_var_t argument.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:27 +0000 (17:21 +0200)]
KVM: x86: hyperv: do rep check for each hypercall separately
Prepare to support TLB flush hypercalls, some of which are REP hypercalls.
Also, return HV_STATUS_INVALID_HYPERCALL_INPUT as it seems more
appropriate.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:26 +0000 (17:21 +0200)]
KVM: x86: hyperv: use defines when parsing hypercall parameters
Avoid open-coding offsets for hypercall input parameters, we already
have defines for them.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vitaly Kuznetsov [Wed, 16 May 2018 15:21:24 +0000 (17:21 +0200)]
x86/hyper-v: move struct hv_flush_pcpu{,ex} definitions to common header
Hyper-V TLB flush hypercalls definitions will be required for KVM so move
them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
irrelevant for a general-purpose definition.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Radim Krčmář [Sat, 26 May 2018 11:45:49 +0000 (13:45 +0200)]
Merge branch 'x86/hyperv' of git://git./linux/kernel/git/tip/tip
To resolve conflicts with the PV TLB flush series.
Eric Auger [Tue, 22 May 2018 07:55:18 +0000 (09:55 +0200)]
KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to 512
Let's raise the number of supported vcpus along with
vgic v3 now that HW is looming with more physical CPUs.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:17 +0000 (09:55 +0200)]
KVM: arm/arm64: Implement KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION
Now all the internals are ready to handle multiple redistributor
regions, let's allow the userspace to register them.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:16 +0000 (09:55 +0200)]
KVM: arm/arm64: Add KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION
This new attribute allows the userspace to set the base address
of a reditributor region, relaxing the constraint of having all
consecutive redistibutor frames contiguous.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:15 +0000 (09:55 +0200)]
KVM: arm/arm64: Check all vcpu redistributors are set on map_resources
On vcpu first run, we eventually know the actual number of vcpus.
This is a synchronization point to check all redistributors
were assigned. On kvm_vgic_map_resources() we check both dist and
redist were set, eventually check potential base address inconsistencies.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:14 +0000 (09:55 +0200)]
KVM: arm/arm64: Check vcpu redist base before registering an iodev
As we are going to register several redist regions,
vgic_register_all_redist_iodevs() may be called several times. We need
to register a redist_iodev for a given vcpu only once. So let's
check if the base address has already been set. Initialize this latter
in kvm_vgic_vcpu_init().
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:13 +0000 (09:55 +0200)]
KVM: arm/arm64: Remove kvm_vgic_vcpu_early_init
kvm_vgic_vcpu_early_init gets called after kvm_vgic_cpu_init which
is confusing. The call path is as follows:
kvm_vm_ioctl_create_vcpu
|_ kvm_arch_cpu_create
|_ kvm_vcpu_init
|_ kvm_arch_vcpu_init
|_ kvm_vgic_vcpu_init
|_ kvm_arch_vcpu_postcreate
|_ kvm_vgic_vcpu_early_init
Static initialization currently done in kvm_vgic_vcpu_early_init()
can be moved to kvm_vgic_vcpu_init(). So let's move the code and
remove kvm_vgic_vcpu_early_init(). kvm_arch_vcpu_postcreate() does
nothing.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:12 +0000 (09:55 +0200)]
KVM: arm/arm64: Helper to register a new redistributor region
We introduce a new helper that creates and inserts a new redistributor
region into the rdist region list. This helper both handles the case
where the redistributor region size is known at registration time
and the legacy case where it is not (eventually depending on the number
of online vcpus). Depending on pfns, we perform all the possible checks
that we can do:
- end of memory crossing
- incorrect alignment of the base address
- collision with distributor region if already defined
- collision with already registered rdist regions
- check of the new index
Rdist regions must be inserted by increasing order of indices. Indices
must be contiguous.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:11 +0000 (09:55 +0200)]
KVM: arm/arm64: Adapt vgic_v3_check_base to multiple rdist regions
vgic_v3_check_base() currently only handles the case of a unique
legacy redistributor region whose size is not explicitly set but
inferred, instead, from the number of online vcpus.
We adapt it to handle the case of multiple redistributor regions
with explicitly defined size. We rely on two new helpers:
- vgic_v3_rdist_overlap() is used to detect overlap with the dist
region if defined
- vgic_v3_rd_region_size computes the size of the redist region,
would it be a legacy unique region or a new explicitly sized
region.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:10 +0000 (09:55 +0200)]
KVM: arm/arm64: Revisit Redistributor TYPER last bit computation
The TYPER of an redistributor reflects whether the rdist is
the last one of the redistributor region. Let's compare the TYPER
GPA against the address of the last occupied slot within the
redistributor region.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:09 +0000 (09:55 +0200)]
KVM: arm/arm64: Helper to locate free rdist index
We introduce vgic_v3_rdist_free_slot to help identifying
where we can place a new 2x64KB redistributor.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:08 +0000 (09:55 +0200)]
KVM: arm/arm64: Replace the single rdist region by a list
At the moment KVM supports a single rdist region. We want to
support several separate rdist regions so let's introduce a list
of them. This patch currently only cares about a single
entry in this list as the functionality to register several redist
regions is not yet there. So this only translates the existing code
into something functionally similar using that new data struct.
The redistributor region handle is stored in the vgic_cpu structure
to allow later computation of the TYPER last bit.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:07 +0000 (09:55 +0200)]
KVM: arm/arm64: Document KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION
We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
base address and size of a redistributor region
Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
to declare several separate redistributor regions.
So the whole redist space does not need to be contiguous anymore.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Eric Auger [Tue, 22 May 2018 07:55:06 +0000 (09:55 +0200)]
KVM: arm/arm64: Set dist->spis to NULL after kfree
in case kvm_vgic_map_resources() fails, typically if the vgic
distributor is not defined, __kvm_vgic_destroy will be called
several times. Indeed kvm_vgic_map_resources() is called on
first vcpu run. As a result dist->spis is freeed more than once
and on the second time it causes a "kernel BUG at mm/slub.c:3912!"
Set dist->spis to NULL to avoid the crash.
Fixes:
ad275b8bb1e6 ("KVM: arm/arm64: vgic-new: vgic_init: implement
vgic_init")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Wed, 2 May 2018 13:18:02 +0000 (14:18 +0100)]
KVM: arm64: Invoke FPSIMD context switch trap from C
The conversion of the FPSIMD context switch trap code to C has added
some overhead to calling it, due to the need to save registers that
the procedure call standard defines as caller-saved.
So, perhaps it is no longer worth invoking this trap handler quite
so early.
Instead, we can invoke it from fixup_guest_exit(), with little
likelihood of increasing the overhead much further.
As a convenience, this patch gives __hyp_switch_fpsimd() the same
return semantics fixup_guest_exit(). For now there is no
possibility of a spurious FPSIMD trap, so the function always
returns true, but this allows it to be tail-called with a single
return statement.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Wed, 2 May 2018 12:36:48 +0000 (13:36 +0100)]
KVM: arm64: Fold redundant exit code checks out of fixup_guest_exit()
The entire tail of fixup_guest_exit() is contained in if statements
of the form if (x && *exit_code == ARM_EXCEPTION_TRAP). As a result,
we can check just once and bail out of the function early, allowing
the remaining if conditions to be simplified.
The only awkward case is where *exit_code is changed to
ARM_EXCEPTION_EL1_SERROR in the case of an illegal GICv2 CPU
interface access: in that case, the GICv3 trap handling code is
skipped using a goto. This avoids pointlessly evaluating the
static branch check for the GICv3 case, even though we can't have
vgic_v2_cpuif_trap and vgic_v3_cpuif_trap true simultaneously
unless we have a GICv3 and GICv2 on the host: that sounds stupid,
but I haven't satisfied myself that it can't happen.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Wed, 2 May 2018 12:23:07 +0000 (13:23 +0100)]
KVM: arm64: Remove redundant *exit_code changes in fpsimd_guest_exit()
In fixup_guest_exit(), there are a couple of cases where after
checking what the exit code was, we assign it explicitly with the
value it already had.
Assuming this is not indicative of a bug, these assignments are not
needed.
This patch removes the redundant assignments, and simplifies some
if-nesting that becomes trivial as a result.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Fri, 20 Apr 2018 16:39:16 +0000 (17:39 +0100)]
KVM: arm64: Remove eager host SVE state saving
Now that the host SVE context can be saved on demand from Hyp,
there is no longer any need to save this state in advance before
entering the guest.
This patch removes the relevant call to
kvm_fpsimd_flush_cpu_state().
Since the problem that function was intended to solve now no longer
exists, the function and its dependencies are also deleted.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Fri, 20 Apr 2018 15:20:43 +0000 (16:20 +0100)]
KVM: arm64: Save host SVE context as appropriate
This patch adds SVE context saving to the hyp FPSIMD context switch
path. This means that it is no longer necessary to save the host
SVE state in advance of entering the guest, when in use.
In order to avoid adding pointless complexity to the code, VHE is
assumed if SVE is in use. VHE is an architectural prerequisite for
SVE, so there is no good reason to turn CONFIG_ARM64_VHE off in
kernels that support both SVE and KVM.
Historically, software models exist that can expose the
architecturally invalid configuration of SVE without VHE, so if
this situation is detected at kvm_init() time then KVM will be
disabled.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Thu, 12 Apr 2018 16:32:35 +0000 (17:32 +0100)]
arm64/sve: Move sve_pffr() to fpsimd.h and make inline
In order to make sve_save_state()/sve_load_state() more easily
reusable and to get rid of a potential branch on context switch
critical paths, this patch makes sve_pffr() inline and moves it to
fpsimd.h.
<asm/processor.h> must be included in fpsimd.h in order to make
this work, and this creates an #include cycle that is tricky to
avoid without modifying core code, due to the way the PR_SVE_*()
prctl helpers are included in the core prctl implementation.
Instead of breaking the cycle, this patch defers inclusion of
<asm/fpsimd.h> in <asm/processor.h> until the point where it is
actually needed: i.e., immediately before the prctl definitions.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Thu, 12 Apr 2018 16:04:39 +0000 (17:04 +0100)]
arm64/sve: Switch sve_pffr() argument from task to thread
sve_pffr(), which is used to derive the base address used for
low-level SVE save/restore routines, currently takes the relevant
task_struct as an argument.
The only accessed fields are actually part of thread_struct, so
this patch changes the argument type accordingly. This is done in
preparation for moving this function to a header, where we do not
want to have to include <linux/sched.h> due to the consequent
circular #include problems.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Thu, 12 Apr 2018 15:47:20 +0000 (16:47 +0100)]
arm64/sve: Move read_zcr_features() out of cpufeature.h
Having read_zcr_features() inline in cpufeature.h results in that
header requiring #includes which make it hard to include
<asm/fpsimd.h> elsewhere without triggering header inclusion
cycles.
This is not a hot-path function and arguably should not be in
cpufeature.h in the first place, so this patch moves it to
fpsimd.c, compiled conditionally if CONFIG_ARM64_SVE=y.
This allows some SVE-related #includes to be dropped from
cpufeature.h, which will ease future maintenance.
A couple of missing #includes of <asm/fpsimd.h> are exposed by this
change under arch/arm64/. This patch adds the missing #includes as
necessary.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Fri, 6 Apr 2018 13:55:59 +0000 (14:55 +0100)]
KVM: arm64: Optimise FPSIMD handling to reduce guest/host thrashing
This patch refactors KVM to align the host and guest FPSIMD
save/restore logic with each other for arm64. This reduces the
number of redundant save/restore operations that must occur, and
reduces the common-case IRQ blackout time during guest exit storms
by saving the host state lazily and optimising away the need to
restore the host state before returning to the run loop.
Four hooks are defined in order to enable this:
* kvm_arch_vcpu_run_map_fp():
Called on PID change to map necessary bits of current to Hyp.
* kvm_arch_vcpu_load_fp():
Set up FP/SIMD for entering the KVM run loop (parse as
"vcpu_load fp").
* kvm_arch_vcpu_ctxsync_fp():
Get FP/SIMD into a safe state for re-enabling interrupts after a
guest exit back to the run loop.
For arm64 specifically, this involves updating the host kernel's
FPSIMD context tracking metadata so that kernel-mode NEON use
will cause the vcpu's FPSIMD state to be saved back correctly
into the vcpu struct. This must be done before re-enabling
interrupts because kernel-mode NEON may be used by softirqs.
* kvm_arch_vcpu_put_fp():
Save guest FP/SIMD state back to memory and dissociate from the
CPU ("vcpu_put fp").
Also, the arm64 FPSIMD context switch code is updated to enable it
to save back FPSIMD state for a vcpu, not just current. A few
helpers drive this:
* fpsimd_bind_state_to_cpu(struct user_fpsimd_state *fp):
mark this CPU as having context fp (which may belong to a vcpu)
currently loaded in its registers. This is the non-task
equivalent of the static function fpsimd_bind_to_cpu() in
fpsimd.c.
* task_fpsimd_save():
exported to allow KVM to save the guest's FPSIMD state back to
memory on exit from the run loop.
* fpsimd_flush_state():
invalidate any context's FPSIMD state that is currently loaded.
Used to disassociate the vcpu from the CPU regs on run loop exit.
These changes allow the run loop to enable interrupts (and thus
softirqs that may use kernel-mode NEON) without having to save the
guest's FPSIMD state eagerly.
Some new vcpu_arch fields are added to make all this work. Because
host FPSIMD state can now be saved back directly into current's
thread_struct as appropriate, host_cpu_context is no longer used
for preserving the FPSIMD state. However, it is still needed for
preserving other things such as the host's system registers. To
avoid ABI churn, the redundant storage space in host_cpu_context is
not removed for now.
arch/arm is not addressed by this patch and continues to use its
current save/restore logic. It could provide implementations of
the helpers later if desired.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Tue, 8 May 2018 13:47:23 +0000 (14:47 +0100)]
KVM: arm64: Repurpose vcpu_arch.debug_flags for general-purpose flags
In struct vcpu_arch, the debug_flags field is used to store
debug-related flags about the vcpu state.
Since we are about to add some more flags related to FPSIMD and
SVE, it makes sense to add them to the existing flags field rather
than adding new fields. Since there is only one debug_flags flag
defined so far, there is plenty of free space for expansion.
In preparation for adding more flags, this patch renames the
debug_flags field to simply "flags", and updates comments
appropriately.
The flag definitions are also moved to <asm/kvm_host.h>, since
their presence in <asm/kvm_asm.h> was for purely historical
reasons: these definitions are not used from asm any more, and not
very likely to be as more Hyp asm is migrated to C.
KVM_ARM64_DEBUG_DIRTY_SHIFT has not been used since commit
1ea66d27e7b0 ("arm64: KVM: Move away from the assembly version of
the world switch"), so this patch gets rid of that too.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: fixed minor conflict]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Wed, 9 May 2018 13:27:41 +0000 (14:27 +0100)]
arm64/sve: Refactor user SVE trap maintenance for external use
In preparation for optimising the way KVM manages switching the
guest and host FPSIMD state, it is necessary to provide a means for
code outside arch/arm64/kernel/fpsimd.c to restore the user trap
configuration for SVE correctly for the current task.
Rather than requiring external code to duplicate the maintenance
explicitly, this patch moves the trap maintenenace to
fpsimd_bind_to_cpu(), since it is logically part of the work of
associating the current task with the cpu.
Because fpsimd_bind_to_cpu() is rather a cryptic name to publish
alongside fpsimd_bind_state_to_cpu(), the former function is
renamed to fpsimd_bind_task_to_cpu() to make its purpose more
explicit.
This patch makes appropriate changes to ensure that
fpsimd_bind_task_to_cpu() is always called alongside
task_fpsimd_load(), so that the trap maintenance continues to be
done in every situation where it was done prior to this patch.
As a side-effect, the metadata updates done by
fpsimd_bind_task_to_cpu() now change from conditional to
unconditional in the "already bound" case of sigreturn. This is
harmless, and a couple of extra stores on this slow path will not
impact performance. I consider this a reasonable price to pay for
a slightly cleaner interface.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Mon, 21 May 2018 18:08:15 +0000 (19:08 +0100)]
arm64: fpsimd: Eliminate task->mm checks
Currently the FPSIMD handling code uses the condition task->mm ==
NULL as a hint that task has no FPSIMD register context.
The ->mm check is only there to filter out tasks that cannot
possibly have FPSIMD context loaded, for optimisation purposes.
Also, TIF_FOREIGN_FPSTATE must always be checked anyway before
saving FPSIMD context back to memory. For these reasons, the ->mm
checks are not useful, providing that TIF_FOREIGN_FPSTATE is
maintained in a consistent way for all threads.
The context switch logic is already deliberately optimised to defer
reloads of the regs until ret_to_user (or sigreturn as a special
case), and save them only if they have been previously loaded.
These paths are the only places where the wrong_task and wrong_cpu
conditions can be made false, by calling fpsimd_bind_task_to_cpu().
Kernel threads by definition never reach these paths. As a result,
the wrong_task and wrong_cpu tests in fpsimd_thread_switch() will
always yield true for kernel threads.
This patch removes the redundant checks and special-case code,
ensuring that TIF_FOREIGN_FPSTATE is set whenever a kernel thread
is scheduled in, and ensures that this flag is set for the init
task. The fpsimd_flush_task_state() call already present in
copy_thread() ensures the same for any new task.
With TIF_FOREIGN_FPSTATE always set for kernel threads, this patch
ensures that no extra context save work is added for kernel
threads, and eliminates the redundant context saving that may
currently occur for kernel threads that have acquired an mm via
use_mm().
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Thu, 24 May 2018 14:54:30 +0000 (15:54 +0100)]
arm64: fpsimd: Avoid FPSIMD context leakage for the init task
The init task is started with thread_flags equal to 0, which means
that TIF_FOREIGN_FPSTATE is initially clear.
It is theoretically possible (if unlikely) that the init task could
reach userspace without ever being scheduled out. If this occurs,
data left in the FPSIMD registers by the kernel could be exposed.
This patch fixes this anomaly by ensuring that the init task's
initial TIF_FOREIGN_FPSTATE is set.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Fixes:
005f78cd8849 ("arm64: defer reloading a task's FPSIMD state to userland resume")
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Fri, 6 Apr 2018 13:55:59 +0000 (14:55 +0100)]
arm64: fpsimd: Generalise context saving for non-task contexts
In preparation for allowing non-task (i.e., KVM vcpu) FPSIMD
contexts to be handled by the fpsimd common code, this patch adapts
task_fpsimd_save() to save back the currently loaded context,
removing the explicit dependency on current.
The relevant storage to write back to in memory is now found by
examining the fpsimd_last_state percpu struct.
fpsimd_save() does nothing unless TIF_FOREIGN_FPSTATE is clear, and
fpsimd_last_state is updated under local_bh_disable() or
local_irq_disable() everywhere that TIF_FOREIGN_FPSTATE is cleared:
thus, fpsimd_save() will write back to the correct storage for the
loaded context.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Fri, 16 Feb 2018 16:35:32 +0000 (16:35 +0000)]
KVM: arm64: Convert lazy FPSIMD context switch trap to C
To make the lazy FPSIMD context switch trap code easier to hack on,
this patch converts it to C.
This is not amazingly efficient, but the trap should typically only
be taken once per host context switch.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Christoffer Dall [Fri, 23 Feb 2018 16:23:57 +0000 (17:23 +0100)]
KVM: arm/arm64: Introduce kvm_arch_vcpu_run_pid_change
KVM/ARM differs from other architectures in having to maintain an
additional virtual address space from that of the host and the
guest, because we split the execution of KVM across both EL1 and
EL2.
This results in a need to explicitly map data structures into EL2
(hyp) which are accessed from the hyp code. As we are about to be
more clever with our FPSIMD handling on arm64, which stores data in
the task struct and uses thread_info flags, we will have to map
parts of the currently executing task struct into the EL2 virtual
address space.
However, we don't want to do this on every KVM_RUN, because it is a
fairly expensive operation to walk the page tables, and the common
execution mode is to map a single thread to a VCPU. By introducing
a hook that architectures can select with
HAVE_KVM_VCPU_RUN_PID_CHANGE, we do not introduce overhead for
other architectures, but have a simple way to only map the data we
need when required for arm64.
This patch introduces the framework only, and wires it up in the
arm/arm64 KVM common code.
No functional change.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Wed, 11 Apr 2018 16:59:06 +0000 (17:59 +0100)]
arm64: Use update{,_tsk}_thread_flag()
This patch uses the new update_thread_flag() helpers to simplify a
couple of if () set; else clear; constructs.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Wed, 11 Apr 2018 16:54:20 +0000 (17:54 +0100)]
thread_info: Add update_thread_flag() helpers
There are a number of bits of code sprinkled around the kernel to
set a thread flag if a certain condition is true, and clear it
otherwise.
To help make those call sites terser and less cumbersome, this
patch adds a new family of thread flag manipulators
update*_thread_flag([...,] flag, cond)
which do the equivalent of:
if (cond)
set*_thread_flag([...,] flag);
else
clear*_thread_flag([...,] flag);
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Dave Martin [Mon, 21 May 2018 17:25:43 +0000 (18:25 +0100)]
arm64: fpsimd: Fix TIF_FOREIGN_FPSTATE after invalidating cpu regs
fpsimd_last_state.st is set to NULL as a way of indicating that
current's FPSIMD registers are no longer loaded in the cpu. In
particular, this is done when the kernel temporarily uses or
clobbers the FPSIMD registers for its own purposes, as in CPU PM or
kernel-mode NEON, resulting in them being populated with garbage
data not belonging to a task.
Commit
17eed27b02da ("arm64/sve: KVM: Prevent guests from using
SVE") factors this operation out as a new helper
fpsimd_flush_cpu_state() to make it clearer what is being done
here, and on SVE systems this helper is now used, via
kvm_fpsimd_flush_cpu_state(), to invalidate the registers after KVM
has run a vcpu. The reason for this is that KVM does not yet
understand how to restore the full host SVE registers itself after
loading the guest FPSIMD context into them.
This exposes a particular problem: if fpsimd_last_state.st is set
to NULL without also setting TIF_FOREIGN_FPSTATE, the kernel may
continue to think that current's FPSIMD registers are live even
though they have actually been clobbered.
Prior to the aforementioned commit, the only path where
fpsimd_last_state.st is set to NULL without setting
TIF_FOREIGN_FPSTATE is when kernel_neon_begin() is called by a
kernel thread (where current->mm can be NULL). This does not
matter, because the only harm is that at context-switch time
fpsimd_thread_switch() may unnecessarily save the FPSIMD registers
back to current's thread_struct (even though kernel threads are not
considered to have any FPSIMD context of their own and the
registers will never be reloaded).
Note that although CPU_PM_ENTER lacks the TIF_FOREIGN_FPSTATE
setting, every CPU passing through that path must subsequently pass
through CPU_PM_EXIT before it can re-enter the kernel proper.
CPU_PM_EXIT sets the flag.
The sve_flush_cpu_state() function added by commit
17eed27b02da
also lacks the proper maintenance of TIF_FOREIGN_FPSTATE. This may
cause the bits of a host task's SVE registers that do not alias the
FPSIMD register file to spontaneously appear zeroed if a KVM vcpu
runs in the same task in the meantime. Although this effect is
hidden by the fact that the non-FPSIMD bits of the SVE registers
are zeroed by a syscall anyway, it is doubtless a bad idea to rely
on these different code paths interacting correctly under future
maintenance.
This patch makes TIF_FOREIGN_FPSTATE an unconditional side-effect
of fpsimd_flush_cpu_state(), and removes the set_thread_flag()
calls that become redundant as a result. This ensures that
TIF_FOREIGN_FPSTATE cannot remain clear if the FPSIMD state in the
FPSIMD registers is invalid.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Jingqi Liu [Tue, 22 May 2018 09:01:27 +0000 (17:01 +0800)]
KVM: x86: Expose CLDEMOTE CPU feature to guest VM
The CLDEMOTE instruction hints to hardware that the cache line that
contains the linear address should be moved("demoted") from
the cache(s) closest to the processor core to a level more distant
from the processor core. This may accelerate subsequent accesses
to the line by other cores in the same coherence domain,
especially if the line was written by the core that demotes the line.
This patch exposes the cldemote feature to the guest.
The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf
This patch has a dependency on https://lkml.org/lkml/2018/4/23/928
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Liran Alon [Tue, 22 May 2018 14:16:15 +0000 (17:16 +0300)]
KVM: nVMX: Emulate L1 individual-address invvpid by L0 individual-address invvpid
When vmcs12 uses VPID, all TLB entries populated by L2 are tagged with
vmx->nested.vpid02. Currently, INVVPID executed by L1 is emulated by L0
by using INVVPID single/global-context to flush all TLB entries
tagged with vmx->nested.vpid02 regardless of INVVPID type executed by
L1.
However, we can easily optimize the case of L1 INVVPID on an
individual-address. Just INVVPID given individual-address tagged with
vmx->nested.vpid02.
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[Squashed with a preparatory patch that added the !operand.vpid line.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Liran Alon [Tue, 22 May 2018 14:16:14 +0000 (17:16 +0300)]
KVM: nVMX: Don't flush TLB when vmcs12 uses VPID
Since commit
5c614b3583e7 ("KVM: nVMX: nested VPID emulation"),
vmcs01 and vmcs02 don't share the same VPID. vmcs01 uses vmx->vpid
while vmcs02 uses vmx->nested.vpid02. This was done such that TLB
flush could be avoided when switching between L1 and L2.
However, the above mentioned commit only changed L2 VMEntry logic to
not flush TLB when switching from L1 to L2. It forgot to also remove
the TLB flush which is done when simulating a VMExit from L2 to L1.
To fix this issue, on VMExit from L2 to L1 we flush TLB only in case
vmcs01 enables VPID and vmcs01->vpid==vmcs02->vpid. This happens when
vmcs01 enables VPID and vmcs12 does not.
Fixes:
5c614b3583e7 ("KVM: nVMX: nested VPID emulation")
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Liran Alon [Tue, 22 May 2018 14:16:12 +0000 (17:16 +0300)]
KVM: nVMX: Use vmx local var for referencing vpid02
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Dan Carpenter [Sat, 19 May 2018 06:01:36 +0000 (09:01 +0300)]
KVM: x86: prevent integer overflows in KVM_MEMORY_ENCRYPT_REG_REGION
This is a fix from reviewing the code, but it looks like it might be
able to lead to an Oops. It affects 32bit systems.
The KVM_MEMORY_ENCRYPT_REG_REGION ioctl uses a u64 for range->addr and
range->size but the high 32 bits would be truncated away on a 32 bit
system. This is harmless but it's also harmless to prevent it.
Then in sev_pin_memory() the "uaddr + ulen" calculation can wrap around.
The wrap around can happen on 32 bit or 64 bit systems, but I was only
able to figure out a problem for 32 bit systems. We would pick a number
which results in "npages" being zero. The sev_pin_memory() would then
return ZERO_SIZE_PTR without allocating anything.
I made it illegal to call sev_pin_memory() with "ulen" set to zero.
Hopefully, that doesn't cause any problems. I also changed the type of
"first" and "last" to long, just for cosmetic reasons. Otherwise on a
64 bit system you're saving "uaddr >> 12" in an int and it truncates the
high 20 bits away. The math works in the current code so far as I can
see but it's just weird.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[Brijesh noted that the code is only reachable on X86_64.]
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Sean Christopherson [Thu, 29 Mar 2018 22:04:01 +0000 (15:04 -0700)]
KVM: x86: remove obsolete EXPORT... of handle_mmio_page_fault
handle_mmio_page_fault() was recently moved to be an internal-only
MMU function, i.e. it's static and no longer defined in kvm_host.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Jim Mattson [Tue, 1 May 2018 22:40:28 +0000 (15:40 -0700)]
KVM: nVMX: Ensure that VMCS12 field offsets do not change
Enforce the invariant that existing VMCS12 field offsets must not
change. Experience has shown that without strict enforcement, this
invariant will not be maintained.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Changed the code to use BUILD_BUG_ON_MSG instead of better, but GCC 4.6
requiring _Static_assert. - Radim.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Jim Mattson [Tue, 1 May 2018 22:40:27 +0000 (15:40 -0700)]
KVM: nVMX: Restore the VMCS12 offsets for v4.0 fields
Changing the VMCS12 layout will break save/restore compatibility with
older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are
accepted upstream. Google has already been using these ioctls for some
time, and we implore the community not to disturb the existing layout.
Move the four most recently added fields to preserve the offsets of
the previously defined fields and reserve locations for the vmread and
vmwrite bitmaps, which will be used in the virtualization of VMCS
shadowing (to improve the performance of double-nesting).
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Kept the SDM order in vmcs_field_to_offset_table. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Arnd Bergmann [Mon, 23 Apr 2018 08:04:26 +0000 (10:04 +0200)]
KVM: x86: use timespec64 for KVM_HC_CLOCK_PAIRING
The hypercall was added using a struct timespec based implementation,
but we should not use timespec in new code.
This changes it to timespec64. There is no functional change
here since the implementation is only used in 64-bit kernels
that use the same definition for timespec and timespec64.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Jim Mattson [Thu, 26 Apr 2018 23:09:12 +0000 (16:09 -0700)]
kvm: nVMX: Use nested_run_pending rather than from_vmentry
When saving a vCPU's nested state, the vmcs02 is discarded. Only the
shadow vmcs12 is saved. The shadow vmcs12 contains all of the
information needed to reconstruct an equivalent vmcs02 on restore, but
we have to be able to deal with two contexts:
1. The nested state was saved immediately after an emulated VM-entry,
before the vmcs02 was ever launched.
2. The nested state was saved some time after the first successful
launch of the vmcs02.
Though it's an implementation detail rather than an architected bit,
vmx->nested_run_pending serves to distinguish between these two
cases. Hence, we save it as part of the vCPU's nested state. (Yes,
this is ugly.)
Even when restoring from a checkpoint, it may be necessary to build
the vmcs02 as if prepare_vmcs02 was called from nested_vmx_run. So,
the 'from_vmentry' argument should be dropped, and
vmx->nested_run_pending should be consulted instead. The nested state
restoration code then has to set vmx->nested_run_pending prior to
calling prepare_vmcs02. It's important that the restoration code set
vmx->nested_run_pending anyway, since the flag impacts things like
interrupt delivery as well.
Fixes:
cf8b84f48a59 ("kvm: nVMX: Prepare for checkpointing L2 state")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Mark Rutland [Thu, 10 May 2018 11:13:47 +0000 (12:13 +0100)]
arm64: KVM: Use lm_alias() for kvm_ksym_ref()
For historical reasons, we open-code lm_alias() in kvm_ksym_ref().
Let's use lm_alias() to avoid duplication and make things clearer.
As we have to pull this from <linux/mm.h> (which is not safe for
inclusion in assembly), we may as well move the kvm_ksym_ref()
definition into the existing !__ASSEMBLY__ block.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Thomas Gleixner [Sat, 19 May 2018 19:22:48 +0000 (21:22 +0200)]
x86/Hyper-V/hv_apic: Build the Hyper-V APIC conditionally
The Hyper-V APIC code is built when CONFIG_HYPERV is enabled but the actual
code in that file is guarded with CONFIG_X86_64. There is no point in doing
this. Neither is there a point in having the CONFIG_HYPERV guard in there
because the containing directory is not built when CONFIG_HYPERV=n.
Further for the hv_init_apic() function a stub is provided only for
CONFIG_HYPERV=n, which is pointless as the callsite is not compiled at
all. But for X86_32 the stub is missing and the build fails.
Clean that up:
- Compile hv_apic.c only when CONFIG_X86_64=y
- Make the stub for hv_init_apic() available when CONFG_X86_64=n
Fixes:
6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Thomas Gleixner [Sat, 19 May 2018 14:38:59 +0000 (16:38 +0200)]
x86/Hyper-V/hv_apic: Include asm/apic.h
Not all configurations magically include asm/apic.h, but the Hyper-V code
requires it. Include it explicitely.
Fixes:
6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
K. Y. Srinivasan [Wed, 16 May 2018 21:53:34 +0000 (14:53 -0700)]
X86/Hyper-V: Consolidate the allocation of the hypercall input page
Consolidate the allocation of the hypercall input page.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-5-kys@linuxonhyperv.com
K. Y. Srinivasan [Wed, 16 May 2018 21:53:33 +0000 (14:53 -0700)]
X86/Hyper-V: Consolidate code for converting cpumask to vpset
Consolidate code for converting cpumask to vpset.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-4-kys@linuxonhyperv.com
K. Y. Srinivasan [Wed, 16 May 2018 21:53:32 +0000 (14:53 -0700)]
X86/Hyper-V: Enhanced IPI enlightenment
Support enhanced IPI enlightenments (to target more than 64 CPUs).
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-3-kys@linuxonhyperv.com
K. Y. Srinivasan [Wed, 16 May 2018 21:53:31 +0000 (14:53 -0700)]
X86/Hyper-V: Enable IPI enlightenments
Hyper-V supports hypercalls to implement IPI; use them.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-2-kys@linuxonhyperv.com
K. Y. Srinivasan [Wed, 16 May 2018 21:53:30 +0000 (14:53 -0700)]
X86/Hyper-V: Enlighten APIC access
Hyper-V supports MSR based APIC access; implement
the enlightenment.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-1-kys@linuxonhyperv.com
Linus Torvalds [Sat, 19 May 2018 04:24:26 +0000 (21:24 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hfsplus: stop workqueue when fill_super() failed
mm: don't allow deferred pages with NEED_PER_CPU_KM
MAINTAINERS: add Q: entry to kselftest for patchwork project
radix tree: fix multi-order iteration race
radix tree test suite: multi-order iteration race
radix tree test suite: add item_delete_rcu()
radix tree test suite: fix compilation issue
radix tree test suite: fix mapshift build target
include/linux/mm.h: add new inline function vmf_error()
lib/test_bitmap.c: fix bitmap optimisation tests to report errors correctly
Linus Torvalds [Sat, 19 May 2018 04:22:16 +0000 (21:22 -0700)]
Merge tag 'platform-drivers-x86-v4.17-3' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fix from Darren Hart:
"Remove the last of the "select DELL_SMBIOS" references in the Kconfig"
* tag 'platform-drivers-x86-v4.17-3' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: DELL_WMI use depends on instead of select for DELL_SMBIOS
Linus Torvalds [Sat, 19 May 2018 04:19:02 +0000 (21:19 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
- a modified revert of a patch that made new choices come out for a
couple stm32 clk drivers that really always need to be there when
that particular machine is compiled in
- boot fix on i.MX for Stefan who noticed odd behavior from the
critical flag patch that came in during the merge window
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: stm32: fix: stm32 clock drivers are not compiled by default
clk: imx6ull: use OSC clock during AXI rate change
Linus Torvalds [Sat, 19 May 2018 01:02:01 +0000 (18:02 -0700)]
Merge branch 'i2c/for-current-fixed' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A bunch of driver bugfixes and a MAINTAINERS addition"
* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: add entry for STM32 I2C driver
i2c: viperboard: return message count on master_xfer success
i2c: pmcmsp: fix error return from master_xfer
i2c: pmcmsp: return message count on master_xfer success
i2c: designware: fix poll-after-enable regression
eeprom: at24: fix retrieving the at24_chip_data structure
i2c: core: ACPI: Log device not acking errors at dbg loglevel
i2c: core: ACPI: Improve OpRegion read errors
Tetsuo Handa [Fri, 18 May 2018 23:09:16 +0000 (16:09 -0700)]
hfsplus: stop workqueue when fill_super() failed
syzbot is reporting ODEBUG messages at hfsplus_fill_super() [1]. This
is because hfsplus_fill_super() forgot to call cancel_delayed_work_sync().
As far as I can see, it is hfsplus_mark_mdb_dirty() from
hfsplus_new_inode() in hfsplus_fill_super() that calls
queue_delayed_work(). Therefore, I assume that hfsplus_new_inode() does
not fail if queue_delayed_work() was called, and the out_put_hidden_dir
label is the appropriate location to call cancel_delayed_work_sync().
[1] https://syzkaller.appspot.com/bug?id=
a66f45e96fdbeb76b796bf46eb25ea878c42a6c9
Link: http://lkml.kernel.org/r/964a8b27-cd69-357c-fe78-76b066056201@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4f2e5f086147d543ab03@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Fri, 18 May 2018 23:09:13 +0000 (16:09 -0700)]
mm: don't allow deferred pages with NEED_PER_CPU_KM
It is unsafe to do virtual to physical translations before mm_init() is
called if struct page is needed in order to determine the memory section
number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
we initialize struct pages for all the allocated memory when deferred
struct pages are used.
My recent fix in commit
c9e97a1997 ("mm: initialize pages on demand
during boot") exposed this problem, because it greatly reduced number of
pages that are initialized before mm_init(), but the problem existed
even before my fix, as Fengguang Wu found.
Below is a more detailed explanation of the problem.
We initialize struct pages in four places:
1. Early in boot a small set of struct pages is initialized to fill the
first section, and lower zones.
2. During mm_init() we initialize "struct pages" for all the memory that
is allocated, i.e reserved in memblock.
3. Using on-demand logic when pages are allocated after mm_init call
(when memblock is finished)
4. After smp_init() when the rest free deferred pages are initialized.
The problem occurs if we try to do va to phys translation of a memory
between steps 1 and 2. Because we have not yet initialized struct pages
for all the reserved pages, it is inherently unsafe to do va to phys if
the translation itself requires access of "struct page" as in case of
this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP
The following path exposes the problem:
start_kernel()
trap_init()
setup_cpu_entry_areas()
setup_cpu_entry_area(cpu)
get_cpu_gdt_paddr(cpu)
per_cpu_ptr_to_phys(addr)
pcpu_addr_to_page(addr)
virt_to_page(addr)
pfn_to_page(__pa(addr) >> PAGE_SHIFT)
We disable this path by not allowing NEED_PER_CPU_KM with deferred
struct pages feature.
The problems are discussed in these threads:
http://lkml.kernel.org/r/
20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com
http://lkml.kernel.org/r/
20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com
http://lkml.kernel.org/r/
20180426202619.2768-1-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
Fixes:
3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shuah Khan (Samsung OSG) [Fri, 18 May 2018 23:09:09 +0000 (16:09 -0700)]
MAINTAINERS: add Q: entry to kselftest for patchwork project
A new patchwork project is created to track kselftest patches. Update
the kselftest entry in the MAINTAINERS file adding 'Q:' entry:
https://patchwork.kernel.org/project/linux-kselftest/list/
Link: http://lkml.kernel.org/r/20180515164427.12201-1-shuah@kernel.org
Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 18 May 2018 23:09:06 +0000 (16:09 -0700)]
radix tree: fix multi-order iteration race
Fix a race in the multi-order iteration code which causes the kernel to
hit a GP fault. This was first seen with a production v4.15 based
kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
order 9 PMD DAX entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember for example that
an order 2 entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
We fix this race by fixing the way that skip_siblings() detects sibling
nodes. Instead of testing against the preceding slot we instead look
for siblings via is_sibling_entry() which compares against the position
of the struct radix_tree_node.slots[] array. This ensures that sibling
entries are properly identified, even if they are no longer contiguous
with the 'entry' they point to.
Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
Fixes:
148deab223b2 ("radix-tree: improve multiorder iterators")
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: CR, Sapthagirish <sapthagirish.cr@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 18 May 2018 23:09:01 +0000 (16:09 -0700)]
radix tree test suite: multi-order iteration race
Add a test which shows a race in the multi-order iteration code. This
test reliably hits the race in under a second on my machine, and is the
result of a real bug report against kernel a production v4.15 based
kernel (4.15.6-300.fc27.x86_64). With a real kernel this issue is hit
when using order 9 PMD DAX radix tree entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember that an order 2
entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
In the radix tree test suite this will be caught by the address
sanitizer:
==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60c0008ae400 at pc 0x00000040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0
READ of size 8 at 0x60c0008ae400 thread T3
#0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660
#1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567
#2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655
#3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a)
#4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e)
Link: http://lkml.kernel.org/r/20180503192430.7582-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 18 May 2018 23:08:58 +0000 (16:08 -0700)]
radix tree test suite: add item_delete_rcu()
Currently the lifetime of "struct item" entries in the radix tree are
not controlled by RCU, but are instead deleted inline as they are
removed from the tree.
In the following patches we add a test which has threads iterating over
items pulled from the tree and verifying them in an
rcu_read_lock()/rcu_read_unlock() section. This means that though an
item has been removed from the tree it could still be being worked on by
other threads until the RCU grace period expires. So, we need to
actually free the "struct item" structures at the end of the grace
period, just as we do with "struct radix_tree_node" items.
Link: http://lkml.kernel.org/r/20180503192430.7582-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 18 May 2018 23:08:54 +0000 (16:08 -0700)]
radix tree test suite: fix compilation issue
Pulled from a patch from Matthew Wilcox entitled "xarray: Add definition
of struct xarray":
> From: Matthew Wilcox <mawilcox@microsoft.com>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
https://patchwork.kernel.org/patch/
10341249/
These defines fix this compilation error:
In file included from ./linux/radix-tree.h:6:0,
from ./linux/../../../../include/linux/idr.h:15,
from ./linux/idr.h:1,
from idr.c:4:
./linux/../../../../include/linux/idr.h: In function `idr_init_base':
./linux/../../../../include/linux/radix-tree.h:129:2: warning: implicit declaration of function `spin_lock_init'; did you mean `spinlock_t'? [-Wimplicit-function-declaration]
spin_lock_init(&(root)->xa_lock); \
^
./linux/../../../../include/linux/idr.h:126:2: note: in expansion of macro `INIT_RADIX_TREE'
INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
^~~~~~~~~~~~~~~
by providing a spin_lock_init() wrapper for the v4.17-rc* version of the
radix tree test suite.
Link: http://lkml.kernel.org/r/20180503192430.7582-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Fri, 18 May 2018 23:08:51 +0000 (16:08 -0700)]
radix tree test suite: fix mapshift build target
Commit
c6ce3e2fe3da ("radix tree test suite: Add config option for map
shift") introduced a phony makefile target called 'mapshift' that ends
up generating the file generated/map-shift.h. This phony target was
then added as a dependency of the top level 'targets' build target,
which is what is run when you go to tools/testing/radix-tree and just
type 'make'.
Unfortunately, this phony target doesn't actually work as a dependency,
so you end up getting:
$ make
make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'. Stop.
make: *** Waiting for unfinished jobs....
Fix this by making the file generated/map-shift.h our real makefile
target, and add this a dependency of the top level build target.
Link: http://lkml.kernel.org/r/20180503192430.7582-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Souptick Joarder [Fri, 18 May 2018 23:08:47 +0000 (16:08 -0700)]
include/linux/mm.h: add new inline function vmf_error()
Many places in drivers/ file systems, error was handled in a common way
like below:
ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
vmf_error() will replace this and return vm_fault_t type err.
A lot of drivers and filesystems currently have a rather complex mapping
of errno-to-VM_FAULT code. We have been able to eliminate a lot of it
by just returning VM_FAULT codes directly from functions which are
called exclusively from the fault handling path.
Some functions can be called both from the fault handler and other
context which are expecting an errno, so they have to continue to return
an errno. Some users still need to choose different behaviour for
different errnos, but vmf_error() captures the essential error
translation that's common to all users, and those that need to handle
additional errors can handle them first.
Link: http://lkml.kernel.org/r/20180510174826.GA14268@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Fri, 18 May 2018 23:08:44 +0000 (16:08 -0700)]
lib/test_bitmap.c: fix bitmap optimisation tests to report errors correctly
I had neglected to increment the error counter when the tests failed,
which made the tests noisy when they fail, but not actually return an
error code.
Link: http://lkml.kernel.org/r/20180509114328.9887-1-mpe@ellerman.id.au
Fixes:
3cc78125a081 ("lib/test_bitmap.c: add optimisation tests")
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Yury Norov <ynorov@caviumnetworks.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Darren Hart [Sat, 12 May 2018 19:10:07 +0000 (12:10 -0700)]
platform/x86: DELL_WMI use depends on instead of select for DELL_SMBIOS
If DELL_WMI "select"s DELL_SMBIOS, the DELL_SMBIOS dependencies are
ignored and it is still possible to end up with unmet direct
dependencies.
Change the select to a depends on.
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
Linus Torvalds [Fri, 18 May 2018 17:24:03 +0000 (10:24 -0700)]
Merge tag 'powerpc-4.17-6' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Just three commits.
The two cxl ones are not fixes per se, but they modify code that was
added this cycle so that it will work with a recent firmware change.
And then a fix for a recent commit that added sleeps in the NVRAM
code, which needs to be more careful and not sleep if eg. we're called
in the panic() path.
Thanks to Nicholas Piggin, Philippe Bergheaud, Christophe Lombard"
* tag 'powerpc-4.17-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
cxl: Report the tunneled operations status
cxl: Set the PBCQ Tunnel BAR register when enabling capi mode
Linus Torvalds [Fri, 18 May 2018 17:21:03 +0000 (10:21 -0700)]
Merge tag 'acpi-4.17-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix an ACPICA regression introduced in this cycle and related to the
handling of package objects loaded by the Load and loadTable AML
operators that are not initialized properly after recent changes (Bob
Moore)"
* tag 'acpi-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Add deferred package support for the Load and loadTable operators
Linus Torvalds [Fri, 18 May 2018 17:14:42 +0000 (10:14 -0700)]
Merge tag 'pm-4.17-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix Kconfig dependencies of the armada-37xx cpufreq driver (Miquel
Raynal)"
* tag 'pm-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: armada-37xx: driver relies on cpufreq-dt
Linus Torvalds [Fri, 18 May 2018 17:12:30 +0000 (10:12 -0700)]
Merge tag 'usb-4.17-rc6' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB driver fixes fro 4.17-rc6.
They resolve some reported bugs in the musb driver, the xhci driver,
and a number of small fixes for the usbip driver.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usbip: usbip_host: fix bad unlock balance during stub_probe()
usbip: usbip_host: fix NULL-ptr deref and use-after-free errors
usbip: usbip_host: run rebind from exit when module is removed
usbip: usbip_host: delete device from busid_table after rebind
usbip: usbip_host: refine probe and disconnect debug msgs to be useful
usb: musb: fix remote wakeup racing with suspend
xhci: Fix USB3 NULL pointer dereference at logical disconnect.
Linus Torvalds [Fri, 18 May 2018 17:10:43 +0000 (10:10 -0700)]
Merge tag 'for-linus-
20180518' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Single fix this time, from Coly, fixing a failure case when
CONFIG_DEBUGFS isn't enabled"
* tag 'for-linus-
20180518' of git://git.kernel.dk/linux-block:
bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n
Linus Torvalds [Fri, 18 May 2018 17:09:20 +0000 (10:09 -0700)]
Merge tag 'spi-fix-v4.17-rc5' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes accumilated since the merge window, all
fairly small and driver specific"
* tag 'spi-fix-v4.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: bcm2835aux: ensure interrupts are enabled for shared handler
spi: bcm-qspi: Always read and set BSPI_MAST_N_BOOT_CTRL
spi: bcm-qspi: Avoid setting MSPI_CDRAM_PCS for spi-nor master
spi: pxa2xx: Allow 64-bit DMA
spi: cadence: Add usleep_range() for cdns_spi_fill_tx_fifo()
spi: sh-msiof: Fix bit field overflow writes to TSCR/RSCR
spi: imx: Update MODULE_DESCRIPTION to "SPI Controller driver"
Linus Torvalds [Fri, 18 May 2018 16:58:29 +0000 (09:58 -0700)]
Merge tag 'mtd/fixes-for-4.17-rc6' of git://git.infradead.org/linux-mtd
Pull mtd fixes from Boris Brezillon:
"NAND fixes:
- Fix read path of the Marvell NAND driver
- Make sure we don't pass a u64 to ndelay()
CFI fix:
- Fix the map_word_andequal() implementation"
* tag 'mtd/fixes-for-4.17-rc6' of git://git.infradead.org/linux-mtd:
mtd: rawnand: Fix return type of __DIVIDE() when called with 32-bit
mtd: rawnand: marvell: Fix read logic for layouts with ->nchunks > 2
mtd: Fix comparison in map_word_andequal()
Linus Torvalds [Fri, 18 May 2018 16:24:52 +0000 (09:24 -0700)]
Merge tag 'drm-fixes-for-v4.17-rc6' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Pretty quiet week again: one vmwgfx regression fix, one core buffer
overflow fix, one vc4 leak fix and three i915 fixes"
* tag 'drm-fixes-for-v4.17-rc6' of git://people.freedesktop.org/~airlied/linux:
drm/dumb-buffers: Integer overflow in drm_mode_create_ioctl()
drm/i915/gen9: Add WaClearHIZ_WM_CHICKEN3 for bxt and glk
drm/vmwgfx: Set dmabuf_size when vmw_dmabuf_init is successful
drm/vc4: Fix leak of the file_priv that stored the perfmon.
drm/i915/execlists: Use rmb() to order CSB reads
drm/i915/userptr: reject zero user_size
drm: Match sysfs name in link removal to link creation
Dave Airlie [Fri, 18 May 2018 02:01:49 +0000 (12:01 +1000)]
Merge tag 'drm-intel-fixes-2018-05-17' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Userptr IOCTL zero size check (Matt)
- Two hardware quirk fixes (Michel & Chris)
* tag 'drm-intel-fixes-2018-05-17' of git://anongit.freedesktop.org/drm/drm-intel:
drm/i915/gen9: Add WaClearHIZ_WM_CHICKEN3 for bxt and glk
drm/i915/execlists: Use rmb() to order CSB reads
drm/i915/userptr: reject zero user_size
Linus Torvalds [Thu, 17 May 2018 22:58:12 +0000 (15:58 -0700)]
Merge tag 'hwmon-for-linus-v4.17-rc6' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Two k10temp fixes:
- fix race condition when accessing System Management Network
registers
- fix reading critical temperatures on F15h M60h and M70h
Also add PCI ID's for the AMD Raven Ridge root bridge"
* tag 'hwmon-for-linus-v4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (k10temp) Use API function to access System Management Network
x86/amd_nb: Add support for Raven Ridge CPUs
hwmon: (k10temp) Fix reading critical temperature register