Paolo Bonzini [Sat, 27 Sep 2014 09:03:33 +0000 (11:03 +0200)]
Merge tag 'kvm-arm-for-3.18' of git://git./linux/kernel/git/kvmarm/kvmarm into kvm-next
Changes for KVM for arm/arm64 for 3.18
This includes a bunch of changes:
- Support read-only memory slots on arm/arm64
- Various changes to fix Sparse warnings
- Correctly detect write vs. read Stage-2 faults
- Various VGIC cleanups and fixes
- Dynamic VGIC data strcuture sizing
- Fix SGI set_clear_pend offset bug
- Fix VTTBR_BADDR Mask
- Correctly report the FSC on Stage-2 faults
Conflicts:
virt/kvm/eventfd.c
[duplicate, different patch where the kvm-arm version broke x86.
The kvm tree instead has the right one]
Christoffer Dall [Fri, 26 Sep 2014 10:29:34 +0000 (12:29 +0200)]
arm/arm64: KVM: Report correct FSC for unsupported fault types
When we catch something that's not a permission fault or a translation
fault, we log the unsupported FSC in the kernel log, but we were masking
off the bottom bits of the FSC which was not very helpful.
Also correctly report the FSC for data and instruction faults rather
than telling people it was a DFCS, which doesn't exist in the ARM ARM.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Joel Schopp [Wed, 9 Jul 2014 16:17:04 +0000 (11:17 -0500)]
arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
The current aarch64 calculation for VTTBR_BADDR_MASK masks only 39 bits
and not all the bits in the PA range. This is clearly a bug that
manifests itself on systems that allocate memory in the higher address
space range.
[ Modified from Joel's original patch to be based on PHYS_MASK_SHIFT
instead of a hard-coded value and to move the alignment check of the
allocation to mmu.c. Also added a comment explaining why we hardcode
the IPA range and changed the stage-2 pgd allocation to be based on
the 40 bit IPA range instead of the maximum possible 48 bit PA range.
- Christoffer ]
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Andres Lagar-Cavilla [Thu, 25 Sep 2014 22:26:50 +0000 (15:26 -0700)]
kvm: Fix kvm_get_page_retry_io __gup retval check
Confusion around -EBUSY and zero (inside a BUG_ON no less).
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christoffer Dall [Thu, 25 Sep 2014 16:41:07 +0000 (18:41 +0200)]
arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
The sgi values calculated in read_set_clear_sgi_pend_reg() and
write_set_clear_sgi_pend_reg() were horribly incorrectly multiplied by 4
with catastrophic results in that subfunctions ended up overwriting
memory not allocated for the expected purpose.
This showed up as bugs in kfree() and the kernel complaining a lot of
you turn on memory debugging.
This addresses: http://marc.info/?l=kvm&m=
141164910007868&w=2
Reported-by: Shannon Zhao <zhaoshenglong@huawei.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Paolo Bonzini [Wed, 24 Sep 2014 21:19:45 +0000 (23:19 +0200)]
Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm-next
Patch queue for ppc - 2014-09-24
New awesome things in this release:
- E500: e6500 core support
- E500: guest and remote debug support
- Book3S: remote sw breakpoint support
- Book3S: HV: Minor bugfixes
Alexander Graf (1):
KVM: PPC: Pass enum to kvmppc_get_last_inst
Bharat Bhushan (8):
KVM: PPC: BOOKE: allow debug interrupt at "debug level"
KVM: PPC: BOOKE : Emulate rfdi instruction
KVM: PPC: BOOKE: Allow guest to change MSR_DE
KVM: PPC: BOOKE: Clear guest dbsr in userspace exit KVM_EXIT_DEBUG
KVM: PPC: BOOKE: Guest and hardware visible debug registers are same
KVM: PPC: BOOKE: Add one reg interface for DBSR
KVM: PPC: BOOKE: Add one_reg documentation of SPRG9 and DBSR
KVM: PPC: BOOKE: Emulate debug registers and exception
Madhavan Srinivasan (2):
powerpc/kvm: support to handle sw breakpoint
powerpc/kvm: common sw breakpoint instr across ppc
Michael Neuling (1):
KVM: PPC: Book3S HV: Add register name when loading toc
Mihai Caraman (10):
powerpc/booke: Restrict SPE exception handlers to e200/e500 cores
powerpc/booke: Revert SPE/AltiVec common defines for interrupt numbers
KVM: PPC: Book3E: Increase FPU laziness
KVM: PPC: Book3e: Add AltiVec support
KVM: PPC: Make ONE_REG powerpc generic
KVM: PPC: Move ONE_REG AltiVec support to powerpc
KVM: PPC: Remove the tasklet used by the hrtimer
KVM: PPC: Remove shared defines for SPE and AltiVec interrupts
KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core
KVM: PPC: Book3E: Enable e6500 core
Paul Mackerras (2):
KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads
KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR
Tang Chen [Wed, 24 Sep 2014 07:57:58 +0000 (15:57 +0800)]
kvm: x86: Unpin and remove kvm_arch->apic_access_page
In order to make the APIC access page migratable, stop pinning it in
memory.
And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page. When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tang Chen [Wed, 24 Sep 2014 07:57:54 +0000 (15:57 +0800)]
kvm: vmx: Implement set_apic_access_page_addr
Currently, the APIC access page is pinned by KVM for the entire life
of the guest. We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.
This patch prepares to handle this for the case where there is no nested
virtualization, or where the nested guest does not have an APIC page of
its own. All accesses to kvm->arch.apic_access_page are changed to go
through kvm_vcpu_reload_apic_access_page.
If the APIC access page is invalidated when the host is running, we update
the VMCS in the next guest entry.
If it is invalidated when the guest is running, the MMU notifier will force
an exit, after which we will handle everything as in the previous case.
If it is invalidated when a nested guest is running, the request will update
either the VMCS01 or the VMCS02. Updating the VMCS01 is done at the
next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tang Chen [Wed, 24 Sep 2014 07:57:54 +0000 (15:57 +0800)]
kvm: x86: Add request bit to reload APIC access page address
Currently, the APIC access page is pinned by KVM for the entire life
of the guest. We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.
This patch prepares to handle this in generic code, through a new
request bit (that will be set by the MMU notifier) and a new hook
that is called whenever the request bit is processed.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tang Chen [Wed, 24 Sep 2014 07:57:57 +0000 (15:57 +0800)]
kvm: Add arch specific mmu notifier for page invalidation
This will be used to let the guest run while the APIC access page is
not pinned. Because subsequent patches will fill in the function
for x86, place the (still empty) x86 implementation in the x86.c file
instead of adding an inline function in kvm_host.h.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tang Chen [Wed, 24 Sep 2014 07:57:55 +0000 (15:57 +0800)]
kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
Different architectures need different requests, and in fact we
will use this function in architecture-specific code later. This
will be outside kvm_main.c, so make it non-static and rename it to
kvm_make_all_cpus_request().
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andres Lagar-Cavilla [Mon, 22 Sep 2014 21:54:42 +0000 (14:54 -0700)]
kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.
2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.
3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andres Lagar-Cavilla [Tue, 23 Sep 2014 19:34:54 +0000 (12:34 -0700)]
kvm/x86/mmu: Pass gfn and level to rmapp callback.
Callbacks don't have to do extra computation to learn what the caller
(lvm_handle_hva_range()) knows very well. Useful for
debugging/tracing/printk/future.
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 22 Sep 2014 11:17:48 +0000 (13:17 +0200)]
x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
In that case, KVM will fail to patch VMCALL instructions to VMMCALL
as required on AMD processors.
The failure mode is currently a divide-by-zero exception, which obviously
is a KVM bug that has to be fixed. However, picking the right instruction
between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
the hypervisor.
Reported-by: Chris Webb <chris@arachsys.com>
Tested-by: Chris Webb <chris@arachsys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Chen Yucong [Tue, 23 Sep 2014 02:44:35 +0000 (10:44 +0800)]
kvm: x86: use macros to compute bank MSRs
Avoid open coded calculations for bank MSRs by using well-defined
macros that hide the index of higher bank MSRs.
No semantic changes.
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Tue, 23 Sep 2014 07:01:57 +0000 (10:01 +0300)]
KVM: x86: Remove debug assertion of non-PAE reserved bits
Commit
346874c9507a ("KVM: x86: Fix CR3 reserved bits") removed non-PAE
reserved bits which were not according to Intel SDM. However, residue was left
in a debug assertion (CR3_NONPAE_RESERVED_BITS). Remove it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 19 Sep 2014 23:03:25 +0000 (16:03 -0700)]
kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
vcpu ioctls can hang the calling thread if issued while a vcpu is running.
However, invalid ioctls can happen when userspace tries to probe the kind
of file descriptors (e.g. isatty() calls ioctl(TCGETS)); in that case,
we know the ioctl is going to be rejected as invalid anyway and we can
fail before trying to take the vcpu mutex.
This patch does not change functionality, it just makes invalid ioctls
fail faster.
Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andres Lagar-Cavilla [Wed, 17 Sep 2014 17:51:48 +0000 (10:51 -0700)]
kvm: Faults which trigger IO release the mmap_sem
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.
If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.
This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tiejun Chen [Mon, 22 Sep 2014 02:31:38 +0000 (10:31 +0800)]
kvm: x86: fix two typos in comment
s/drity/dirty and s/vmsc01/vmcs01
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 18 Sep 2014 19:39:44 +0000 (22:39 +0300)]
KVM: vmx: Inject #GP on invalid PAT CR
Guest which sets the PAT CR to invalid value should get a #GP. Currently, if
vmx supports loading PAT CR during entry, then the value is not checked. This
patch makes the required check in that case.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 18 Sep 2014 19:39:43 +0000 (22:39 +0300)]
KVM: x86: emulating descriptor load misses long-mode case
In 64-bit mode a #GP should be delivered to the guest "if the code segment
descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit
set and the D-bit clear." - Intel SDM "Interrupt 13—General Protection
Exception (#GP)".
This patch fixes the behavior of CS loading emulation code. Although the
comment says that segment loading is not supported in long mode, this function
is executed in long mode, so the fix is necassary.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liang Chen [Thu, 18 Sep 2014 16:38:37 +0000 (12:38 -0400)]
KVM: x86: directly use kvm_make_request again
A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Thu, 18 Sep 2014 16:38:36 +0000 (12:38 -0400)]
KVM: x86: count actual tlb flushes
- we count KVM_REQ_TLB_FLUSH requests, not actual flushes
(KVM can have multiple requests for one flush)
- flushes from kvm_flush_remote_tlbs aren't counted
- it's easy to make a direct request by mistake
Solve these by postponing the counting to kvm_check_request().
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marcelo Tosatti [Thu, 18 Sep 2014 21:24:57 +0000 (18:24 -0300)]
KVM: nested VMX: disable perf cpuid reporting
Initilization of L2 guest with -cpu host, on L1 guest with -cpu host
triggers:
(qemu) KVM: entry failed, hardware error 0x7
...
nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported
Nested VMX MSR load/store support is not sufficient to
allow perf for L2 guest.
Until properly fixed, trap CPUID and disable function 0xA.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Tue, 16 Sep 2014 23:50:50 +0000 (02:50 +0300)]
KVM: x86: Don't report guest userspace emulation error to userspace
Commit
fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures to
userspace due to race-condition between a vmexit and the instruction emulator.
The same rational applies also to userspace applications that are permitted by
the guest OS to access MMIO area or perform PIO.
This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 16 Sep 2014 11:37:40 +0000 (13:37 +0200)]
kvm: Make init_rmode_tss() return 0 on success.
In init_rmode_tss(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_set_tss_addr(), and ret is redundant.
This patch removes the redundant variable, by making init_rmode_tss()
return 0 on success, -errno on failure.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Tue, 16 Sep 2014 12:10:03 +0000 (15:10 +0300)]
KVM: x86: Warn if guest virtual address space is not 48-bits
The KVM emulator code assumes that the guest virtual address space (in 64-bit)
is 48-bits wide. Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if
userspace tries to create a guest that does not obey this restriction.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 24 Sep 2014 11:02:46 +0000 (13:02 +0200)]
kvm-vfio: do not use module_init
/me got confused between the kernel and QEMU. In the kernel, you can
only have one module_init function, and it will prevent unloading the
module unless you also have the corresponding module_exit function.
So, commit
80ce1639727e (KVM: VFIO: register kvm_device_ops dynamically,
2014-09-02) broke unloading of the kvm module, by adding a module_init
function and no module_exit.
Repair it by making kvm_vfio_ops_init weak, and checking it in
kvm_init.
Cc: Will Deacon <will.deacon@arm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Alex Williamson <Alex.Williamson@redhat.com>
Fixes:
80ce1639727e9d38729c34f162378508c307ca25
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christoffer Dall [Mon, 22 Sep 2014 21:33:08 +0000 (23:33 +0200)]
KVM: EVENTFD: Remove inclusion of irq.h
Commit c77dcac (KVM: Move more code under CONFIG_HAVE_KVM_IRQFD) added
functionality that depends on definitions in ioapic.h when
__KVM_HAVE_IOAPIC is defined.
At the same time, kvm-arm commit 0ba0951 (KVM: EVENTFD: remove inclusion
of irq.h) removed the inclusion of irq.h, an architecture-specific header
that is not present on ARM but which happened to include ioapic.h on x86.
Include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined.
This fixes x86 and lets ARM use eventfd.c.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Wed, 10 Sep 2014 12:37:29 +0000 (14:37 +0200)]
KVM: PPC: Pass enum to kvmppc_get_last_inst
The kvmppc_get_last_inst function recently received a facelift that allowed
us to pass an enum of the type of instruction we want to read into it rather
than an unreadable boolean.
Unfortunately, not all callers ended up passing the enum. This wasn't really
an issue as "true" and "false" happen to match the two enum values we have,
but it's still hard to read.
Update all callers of kvmppc_get_last_inst() to follow the new calling
convention.
Signed-off-by: Alexander Graf <agraf@suse.de>
Madhavan Srinivasan [Tue, 9 Sep 2014 17:07:36 +0000 (22:37 +0530)]
powerpc/kvm: common sw breakpoint instr across ppc
This patch extends the use of illegal instruction as software
breakpoint instruction across the ppc platform. Patch extends
booke program interrupt code to support software breakpoint.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[agraf: Fix bookehv]
Signed-off-by: Alexander Graf <agraf@suse.de>
Madhavan Srinivasan [Tue, 9 Sep 2014 17:07:35 +0000 (22:37 +0530)]
powerpc/kvm: support to handle sw breakpoint
This patch adds kernel side support for software breakpoint.
Design is that, by using an illegal instruction, we trap to hypervisor
via Emulation Assistance interrupt, where we check for the illegal instruction
and accordingly we return to Host or Guest. Patch also adds support for
software breakpoint in PR KVM.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Mon, 1 Sep 2014 09:01:59 +0000 (12:01 +0300)]
KVM: PPC: Book3E: Enable e6500 core
Now that AltiVec and hardware thread support is in place enable e6500 core.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Mon, 1 Sep 2014 09:01:58 +0000 (12:01 +0300)]
KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core
ePAPR represents hardware threads as cpu node properties in device tree.
So with existing QEMU, hardware threads are simply exposed as vcpus with
one hardware thread.
The e6500 core shares TLBs between hardware threads. Without tlb write
conditional instruction, the Linux kernel uses per core mechanisms to
protect against duplicate TLB entries.
The guest is unable to detect real siblings threads, so it can't use the
TLB protection mechanism. An alternative solution is to use the hypervisor
to allocate different lpids to guest's vcpus that runs simultaneous on real
siblings threads. On systems with two threads per core this patch halves
the size of the lpid pool that the allocator sees and use two lpids per VM.
Use even numbers to speedup vcpu lpid computation with consecutive lpids
per VM: vm1 will use lpids 2 and 3, vm2 lpids 4 and 5, and so on.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: fix spelling]
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Tue, 2 Sep 2014 06:14:43 +0000 (16:14 +1000)]
KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR
Since the guest can read the machine's PVR (Processor Version Register)
directly and see the real value, we should disallow userspace from
setting any value for the guest's PVR other than the real host value.
Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied
PVR value and return an error if it is different from the host value,
which has been put into vcpu->arch.pvr at vcpu creation time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Tue, 2 Sep 2014 06:14:42 +0000 (16:14 +1000)]
KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads
Occasional failures have been seen with split-core mode and migration
where the message "KVM: couldn't grab cpu" appears. This increases
the length of time that we wait from 1ms to 10ms, which seems to
work around the issue.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Mon, 1 Sep 2014 10:17:43 +0000 (13:17 +0300)]
KVM: PPC: Remove shared defines for SPE and AltiVec interrupts
We currently decide at compile-time which of the SPE or AltiVec units to
support exclusively. Guard kernel defines with CONFIG_SPE_POSSIBLE and
CONFIG_PPC_E500MC and remove shared defines.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Mon, 1 Sep 2014 14:19:56 +0000 (17:19 +0300)]
KVM: PPC: Remove the tasklet used by the hrtimer
Powerpc timer implementation is a copycat version of s390. Now that they removed
the tasklet with commit
ea74c0ea1b24a6978a6ebc80ba4dbc7b7848b32d follow this
optimization.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 13 Aug 2014 09:09:44 +0000 (14:39 +0530)]
KVM: PPC: BOOKE: Emulate debug registers and exception
This patch emulates debug registers and debug exception
to support guest using debug resource. This enables running
gdb/kgdb etc in guest.
On BOOKE architecture we cannot share debug resources between QEMU and
guest because:
When QEMU is using debug resources then debug exception must
be always enabled. To achieve this we set MSR_DE and also set
MSRP_DEP so guest cannot change MSR_DE.
When emulating debug resource for guest we want guest
to control MSR_DE (enable/disable debug interrupt on need).
So above mentioned two configuration cannot be supported
at the same time. So the result is that we cannot share
debug resources between QEMU and Guest on BOOKE architecture.
In the current design QEMU gets priority over guest, this means that if
QEMU is using debug resources then guest cannot use them and if guest is
using debug resource then QEMU can overwrite them.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 20 Aug 2014 13:36:25 +0000 (16:36 +0300)]
KVM: PPC: Move ONE_REG AltiVec support to powerpc
Move ONE_REG AltiVec support to powerpc generic layer.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 20 Aug 2014 13:36:24 +0000 (16:36 +0300)]
KVM: PPC: Make ONE_REG powerpc generic
Make ONE_REG generic for server and embedded architectures by moving
kvm_vcpu_ioctl_get_one_reg() and kvm_vcpu_ioctl_set_one_reg() functions
to powerpc layer.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 20 Aug 2014 13:36:23 +0000 (16:36 +0300)]
KVM: PPC: Book3e: Add AltiVec support
Add AltiVec support in KVM for Book3e. FPU support gracefully reuse host
infrastructure so follow the same approach for AltiVec.
Book3e specification defines shared interrupt numbers for SPE and AltiVec
units. Still SPE is present in e200/e500v2 cores while AltiVec is present in
e6500 core. So we can currently decide at compile-time which of the SPE or
AltiVec units to support exclusively by using CONFIG_SPE_POSSIBLE and
CONFIG_PPC_E500MC defines. As Alexander Graf suggested, keep SPE and AltiVec
exception handlers distinct to improve code readability.
Guests have the privilege to enable AltiVec, so we always need to support
AltiVec in KVM and implicitly in host to reflect interrupts and to save/restore
the unit context. KVM will be loaded on cores with AltiVec unit only if
CONFIG_ALTIVEC is defined. Use this define to guard KVM AltiVec logic.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 20 Aug 2014 13:36:22 +0000 (16:36 +0300)]
KVM: PPC: Book3E: Increase FPU laziness
Increase FPU laziness by loading the guest state into the unit before entering
the guest instead of doing it on each vcpu schedule. Without this improvement
an interrupt may claim floating point corrupting guest state.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 13 Aug 2014 09:10:06 +0000 (14:40 +0530)]
KVM: PPC: BOOKE: Add one_reg documentation of SPRG9 and DBSR
This was missed in respective one_reg implementation patch.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Michael Neuling [Tue, 19 Aug 2014 04:59:30 +0000 (14:59 +1000)]
KVM: PPC: Book3S HV: Add register name when loading toc
Add 'r' to register name r2 in kvmppc_hv_enter.
Also update comment at the top of kvmppc_hv_enter to indicate that R2/TOC is
non-volatile.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 20 Aug 2014 13:09:04 +0000 (16:09 +0300)]
powerpc/booke: Revert SPE/AltiVec common defines for interrupt numbers
Book3E specification defines shared interrupt numbers for SPE and AltiVec
units. Still SPE is present in e200/e500v2 cores while AltiVec is present in
e6500 core. So we can currently decide at compile-time which unit to support
exclusively. As Alexander Graf suggested, this will improve code readability
especially in KVM.
Use distinct defines to identify SPE/AltiVec interrupt numbers, reverting
c58ce397 and
6b310fc5 patches that added common defines.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 20 Aug 2014 13:09:03 +0000 (16:09 +0300)]
powerpc/booke: Restrict SPE exception handlers to e200/e500 cores
SPE exception handlers are now defined for 32-bit e500mc cores even though
SPE unit is not present and CONFIG_SPE is undefined.
Restrict SPE exception handlers to e200/e500 cores adding CONFIG_SPE_POSSIBLE
and consequently guard __stup_ivors and __setup_cpu functions.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 6 Aug 2014 06:38:56 +0000 (12:08 +0530)]
KVM: PPC: BOOKE: Add one reg interface for DBSR
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 6 Aug 2014 06:38:55 +0000 (12:08 +0530)]
KVM: PPC: BOOKE: Guest and hardware visible debug registers are same
Guest visible debug register and hardware visible debug registers are
same, so ther is no need to have arch->shadow_dbg_reg, instead use
arch->dbg_reg.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 6 Aug 2014 06:38:54 +0000 (12:08 +0530)]
KVM: PPC: BOOKE: Clear guest dbsr in userspace exit KVM_EXIT_DEBUG
Dbsr is not visible to userspace and we do not think any need to
expose this to userspace because:
Userspace cannot inject debug interrupt to guest (as this
does not know guest ability to handle debug interrupt), so
userspace will always clear DBSR.
Now if userspace has to always clear DBSR in KVM_EXIT_DEBUG
handling then clearing dbsr in kernel looks simple as this
avoid doing SET_SREGS/set_one_reg() to clear DBSR
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 6 Aug 2014 06:38:53 +0000 (12:08 +0530)]
KVM: PPC: BOOKE: Allow guest to change MSR_DE
This patch changes the default behavior of MSRP_DEP, that is
guest is not allowed to change the MSR_DE, to guest can change
MSR_DE. When userspace is debugging guest then it override the
default behavior and set MSRP_DEP. This stops guest to change
MSR_DE when userspace is debugging guest.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 6 Aug 2014 06:38:52 +0000 (12:08 +0530)]
KVM: PPC: BOOKE : Emulate rfdi instruction
This patch adds "rfdi" instruction emulation which is required for
guest debug hander on BOOKE-HV
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Bharat Bhushan [Wed, 6 Aug 2014 06:38:51 +0000 (12:08 +0530)]
KVM: PPC: BOOKE: allow debug interrupt at "debug level"
Debug interrupt can be either "critical level" or "debug level".
There are separate set of save/restore registers used for different level.
Example: DSRR0/DSRR1 are used for "debug level" and CSRR0/CSRR1
are used for critical level debug interrupt.
Using CPU_FTR_DEBUG_LVL_EXC to decide which interrupt level to be used.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Marc Zyngier [Tue, 8 Jul 2014 11:09:07 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: make number of irqs a configurable attribute
In order to make the number of interrupts configurable, use the new
fancy device management API to add KVM_DEV_ARM_VGIC_GRP_NR_IRQS as
a VGIC configurable attribute.
Userspace can now specify the exact size of the GIC (by increments
of 32 interrupts).
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:06 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: delay vgic allocation until init time
It is now quite easy to delay the allocation of the vgic tables
until we actually require it to be up and running (when the first
vcpu is kicking around, or someones tries to access the GIC registers).
This allow us to allocate memory for the exact number of CPUs we
have. As nobody configures the number of interrupts just yet,
use a fallback to VGIC_NR_IRQS_LEGACY.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:05 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: kill VGIC_NR_IRQS
Nuke VGIC_NR_IRQS entierly, now that the distributor instance
contains the number of IRQ allocated to this GIC.
Also add VGIC_NR_IRQS_LEGACY to preserve the current API.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:04 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: handle out-of-range MMIO accesses
Now that we can (almost) dynamically size the number of interrupts,
we're facing an interesting issue:
We have to evaluate at runtime whether or not an access hits a valid
register, based on the sizing of this particular instance of the
distributor. Furthermore, the GIC spec says that accessing a reserved
register is RAZ/WI.
For this, add a new field to our range structure, indicating the number
of bits a single interrupts uses. That allows us to find out whether or
not the access is in range.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:03 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: kill VGIC_MAX_CPUS
We now have the information about the number of CPU interfaces in
the distributor itself. Let's get rid of VGIC_MAX_CPUS, and just
rely on KVM_MAX_VCPUS where we don't have the choice. Yet.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:02 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: Parametrize VGIC_NR_SHARED_IRQS
Having a dynamic number of supported interrupts means that we
cannot relly on VGIC_NR_SHARED_IRQS being fixed anymore.
Instead, make it take the distributor structure as a parameter,
so it can return the right value.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:01 +0000 (12:09 +0100)]
arm/arm64: KVM: vgic: switch to dynamic allocation
So far, all the VGIC data structures are statically defined by the
*maximum* number of vcpus and interrupts it supports. It means that
we always have to oversize it to cater for the worse case.
Start by changing the data structures to be dynamically sizeable,
and allocate them at runtime.
The sizes are still very static though.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 8 Jul 2014 11:09:00 +0000 (12:09 +0100)]
KVM: ARM: vgic: plug irq injection race
As it stands, nothing prevents userspace from injecting an interrupt
before the guest's GIC is actually initialized.
This goes unnoticed so far (as everything is pretty much statically
allocated), but ends up exploding in a spectacular way once we switch
to a more dynamic allocation (the GIC data structure isn't there yet).
The fix is to test for the "ready" flag in the VGIC distributor before
trying to inject the interrupt. Note that in order to avoid breaking
userspace, we have to ignore what is essentially an error.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Sat, 14 Jun 2014 20:34:04 +0000 (22:34 +0200)]
arm/arm64: KVM: vgic: Clarify and correct vgic documentation
The VGIC virtual distributor implementation documentation was written a
very long time ago, before the true nature of the beast had been
partially absorbed into my bloodstream. Clarify the docs.
Plus, it fixes an actual bug. ICFRn, pfff.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Sat, 14 Jun 2014 20:30:45 +0000 (22:30 +0200)]
arm/arm64: KVM: vgic: Fix SGI writes to GICD_I{CS}PENDR0
Writes to GICD_ISPENDR0 and GICD_ICPENDR0 ignore all settings of the
pending state for SGIs. Make sure the implementation handles this
correctly.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Sat, 14 Jun 2014 19:54:51 +0000 (21:54 +0200)]
arm/arm64: KVM: vgic: Improve handling of GICD_I{CS}PENDRn
Writes to GICD_ISPENDRn and GICD_ICPENDRn are currently not handled
correctly for level-triggered interrupts. The spec states that for
level-triggered interrupts, writes to the GICD_ISPENDRn activate the
output of a flip-flop which is in turn or'ed with the actual input
interrupt signal. Correspondingly, writes to GICD_ICPENDRn simply
deactivates the output of that flip-flop, but does not (of course) affect
the external input signal. Reads from GICC_IAR will also deactivate the
flip-flop output.
This requires us to track the state of the level-input separately from
the state in the flip-flop. We therefore introduce two new variables on
the distributor struct to track these two states. Astute readers may
notice that this is introducing more state than required (because an OR
of the two states gives you the pending state), but the remaining vgic
code uses the pending bitmap for optimized operations to figure out, at
the end of the day, if an interrupt is pending or not on the distributor
side. Refactoring the code to consider the two state variables all the
places where we currently access the precomputed pending value, did not
look pretty.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Sat, 14 Jun 2014 20:37:33 +0000 (22:37 +0200)]
arm/arm64: KVM: vgic: Clear queued flags on unqueue
If we unqueue a level-triggered interrupt completely, and the LR does
not stick around in the active state (and will therefore no longer
generate a maintenance interrupt), then we should clear the queued flag
so that the vgic can actually queue this level-triggered interrupt at a
later time and deal with its pending state then.
Note: This should actually be properly fixed to handle the active state
on the distributor.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Mon, 9 Jun 2014 10:55:13 +0000 (12:55 +0200)]
arm/arm64: KVM: Rename irq_active to irq_queued
We have a special bitmap on the distributor struct to keep track of when
level-triggered interrupts are queued on the list registers. This was
named irq_active, which is confusing, because the active state of an
interrupt as per the GIC spec is a different thing, not specifically
related to edge-triggered/level-triggered configurations but rather
indicates an interrupt which has been ack'ed but not yet eoi'ed.
Rename the bitmap and the corresponding accessor functions to irq_queued
to clarify what this is actually used for.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Mon, 9 Jun 2014 10:27:18 +0000 (12:27 +0200)]
arm/arm64: KVM: Rename irq_state to irq_pending
The irq_state field on the distributor struct is ambiguous in its
meaning; the comment says it's the level of the input put, but that
doesn't make much sense for edge-triggered interrupts. The code
actually uses this state variable to check if the interrupt is in the
pending state on the distributor so clarify the comment and rename the
actual variable and accessor methods.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Christoffer Dall [Fri, 19 Sep 2014 01:15:32 +0000 (18:15 -0700)]
Merge remote-tracking branch 'kvm/next' into queue
Conflicts:
arch/arm64/include/asm/kvm_host.h
virt/kvm/arm/vgic.c
Tang Chen [Tue, 16 Sep 2014 10:41:59 +0000 (18:41 +0800)]
kvm: Make init_rmode_identity_map() return 0 on success.
In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.
This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tang Chen [Tue, 16 Sep 2014 10:41:58 +0000 (18:41 +0800)]
kvm: Remove ept_identity_pagetable from struct kvm_arch.
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.
In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized
Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.
NOTE: In the original code, ept identity pagetable page is pinned in memroy.
As a result, it cannot be migrated/hot-removed. After this patch, since
kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
is no longer pinned in memory. And it can be migrated/hot-removed.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Will Deacon [Tue, 2 Sep 2014 09:27:36 +0000 (10:27 +0100)]
KVM: VFIO: register kvm_device_ops dynamically
Now that we have a dynamic means to register kvm_device_ops, use that
for the VFIO kvm device, instead of relying on the static table.
This is achieved by a module_init call to register the ops with KVM.
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alex Williamson <Alex.Williamson@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cornelia Huck [Tue, 2 Sep 2014 09:27:35 +0000 (10:27 +0100)]
KVM: s390: register flic ops dynamically
Using the new kvm_register_device_ops() interface makes us get rid of
an #ifdef in common code.
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Will Deacon [Tue, 2 Sep 2014 09:27:34 +0000 (10:27 +0100)]
KVM: ARM: vgic: register kvm_device_ops dynamically
Now that we have a dynamic means to register kvm_device_ops, use that
for the ARM VGIC, instead of relying on the static table.
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Will Deacon [Tue, 2 Sep 2014 09:27:33 +0000 (10:27 +0100)]
KVM: device: add simple registration mechanism for kvm_device_ops
kvm_ioctl_create_device currently has knowledge of all the device types
and their associated ops. This is fairly inflexible when adding support
for new in-kernel device emulations, so move what we currently have out
into a table, which can support dynamic registration of ops by new
drivers for virtual hardware.
Cc: Alex Williamson <Alex.Williamson@redhat.com>
Cc: Alex Graf <agraf@suse.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zhang Haoyu [Thu, 11 Sep 2014 08:47:04 +0000 (16:47 +0800)]
kvm: ioapic: conditionally delay irq delivery duringeoi broadcast
Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some delay between
the EOI writing and irq delivery. If we do not emulate this behavior, and
re-inject the interrupt immediately after the guest sends an EOI and re-enables
interrupts, a guest might spend all its time in the ISR if it has a broken
handler for a level-triggered interrupt.
Such livelock actually happens with Windows guests when resuming from
hibernation.
As there's no way to recognize the broken handle from new raised ones, this patch
delays an interrupt if 10.000 consecutive EOIs found that the interrupt was
still high. The guest can then make a little forward progress, until a proper
IRQ handler is set or until some detection routine in the guest (such as
Linux's note_interrupt()) recognizes the situation.
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Zhang Haoyu <zhanghy@sangfor.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guo Hui Liu [Fri, 12 Sep 2014 05:43:19 +0000 (13:43 +0800)]
KVM: x86: Use kvm_make_request when applicable
This patch replace the set_bit method by kvm_make_request
to make code more readable and consistent.
Signed-off-by: Guo Hui Liu <liuguohui@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eric Auger [Mon, 1 Sep 2014 08:36:08 +0000 (09:36 +0100)]
KVM: EVENTFD: remove inclusion of irq.h
No more needed. irq.h would be void on ARM.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Ard Biesheuvel [Tue, 9 Sep 2014 10:27:09 +0000 (11:27 +0100)]
ARM/arm64: KVM: fix use of WnR bit in kvm_is_write_fault()
The ISS encoding for an exception from a Data Abort has a WnR
bit[6] that indicates whether the Data Abort was caused by a
read or a write instruction. While there are several fields
in the encoding that are only valid if the ISV bit[24] is set,
WnR is not one of them, so we can read it unconditionally.
Instead of fixing both implementations of kvm_is_write_fault()
in place, reimplement it just once using kvm_vcpu_dabt_iswrite(),
which already does the right thing with respect to the WnR bit.
Also fix up the callers to pass 'vcpu'
Acked-by: Laszlo Ersek <lersek@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Paolo Bonzini [Thu, 11 Sep 2014 09:51:02 +0000 (11:51 +0200)]
KVM: x86: make apic_accept_irq tracepoint more generic
Initially the tracepoint was added only to the APIC_DM_FIXED case,
also because it reported coalesced interrupts that only made sense
for that case. However, the coalesced argument is not used anymore
and tracing other delivery modes is useful, so hoist the call out
of the switch statement.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tang Chen [Thu, 11 Sep 2014 05:38:00 +0000 (13:38 +0800)]
kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of
apic access page. So use this macro.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 11 Sep 2014 09:09:33 +0000 (11:09 +0200)]
Merge tag 'kvm-s390-next-
20140910' of git://git./linux/kernel/git/kvms390/linux into kvm-next
KVM: s390: Fixes and features for next (3.18)
1. Crypto/CPACF support: To enable the MSA4 instructions we have to
provide a common control structure for each SIE control block
2. Two cleanups found by a static code checker: one redundant assignment
and one useless if
3. Fix the page handling of the diag10 ballooning interface. If the
guest freed the pages at absolute 0 some checks and frees were
incorrect
4. Limit guests to 16TB
5. Add __must_check to interrupt injection code
Christian Borntraeger [Wed, 3 Sep 2014 14:16:47 +0000 (16:16 +0200)]
KVM: s390/interrupt: remove double assignment
r is already initialized to 0.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Christian Borntraeger [Wed, 3 Sep 2014 19:23:13 +0000 (21:23 +0200)]
KVM: s390/cmm: Fix prefix handling for diag 10 balloon
The old handling of prefix pages was broken in the diag10 ballooner.
We now rely on gmap_discard to check for start > end and do a
slow path if the prefix swap pages are affected:
1. discard the pages from start to prefix
2. discard the absolute 0 pages
3. discard the pages after prefix swap to end
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Christian Borntraeger [Wed, 3 Sep 2014 19:17:03 +0000 (21:17 +0200)]
KVM: s390: get rid of constant condition in ipte_unlock_simple
Due to the earlier check we know that ipte_lock_count must be 0.
No need to add a useless if. Let's make clear that we are going
to always wakeup when we execute that code.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Christian Borntraeger [Wed, 3 Sep 2014 14:21:32 +0000 (16:21 +0200)]
KVM: s390: unintended fallthrough for external call
We must not fallthrough if the conditions for external call are not met.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Christian Borntraeger [Mon, 25 Aug 2014 10:38:57 +0000 (12:38 +0200)]
KVM: s390: Limit guest size to 16TB
Currently we fill up a full 5 level page table to hold the guest
mapping. Since commit "support gmap page tables with less than 5
levels" we can do better.
Having more than 4 TB might be useful for some testing scenarios,
so let's just limit ourselves to 16TB guest size.
Having more than that is totally untested as I do not have enough
swap space/memory.
We continue to allow ucontrol the full size.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Christian Borntraeger [Mon, 25 Aug 2014 10:27:29 +0000 (12:27 +0200)]
KVM: s390: add __must_check to interrupt deliver functions
We now propagate interrupt injection errors back to the ioctl. We
should mark functions that might fail with __must_check.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Tony Krowiak [Fri, 27 Jun 2014 18:46:01 +0000 (14:46 -0400)]
KVM: CPACF: Enable MSA4 instructions for kvm guest
We have to provide a per guest crypto block for the CPUs to
enable MSA4 instructions. According to icainfo on z196 or
later this enables CCM-AES-128, CMAC-AES-128, CMAC-AES-192
and CMAC-AES-256.
Signed-off-by: Tony Krowiak <akrowiak@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[split MSA4/protected key into two patches]
Alex Bennée [Tue, 9 Sep 2014 16:27:19 +0000 (17:27 +0100)]
KVM: fix api documentation of KVM_GET_EMULATED_CPUID
It looks like when this was initially merged it got accidentally included
in the following section. I've just moved it back in the correct section
and re-numbered it as other ioctls have been added since.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alex Bennée [Tue, 9 Sep 2014 16:27:18 +0000 (17:27 +0100)]
KVM: document KVM_SET_GUEST_DEBUG api
In preparation for working on the ARM implementation I noticed the debug
interface was missing from the API document. I've pieced together the
expected behaviour from the code and commit messages written it up as
best I can.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christian Borntraeger [Thu, 4 Sep 2014 19:13:33 +0000 (21:13 +0200)]
KVM: remove redundant assignments in __kvm_set_memory_region
__kvm_set_memory_region sets r to EINVAL very early.
Doing it again is not necessary. The same is true later on, where
r is assigned -ENOMEM twice.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christian Borntraeger [Thu, 4 Sep 2014 19:13:32 +0000 (21:13 +0200)]
KVM: remove redundant assigment of return value in kvm_dev_ioctl
The first statement of kvm_dev_ioctl is
long r = -EINVAL;
No need to reassign the same value.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christian Borntraeger [Thu, 4 Sep 2014 19:13:31 +0000 (21:13 +0200)]
KVM: remove redundant check of in_spin_loop
The expression `vcpu->spin_loop.in_spin_loop' is always true,
because it is evaluated only when the condition
`!vcpu->spin_loop.in_spin_loop' is false.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 2 Sep 2014 11:23:06 +0000 (13:23 +0200)]
KVM: x86: propagate exception from permission checks on the nested page fault
Currently, if a permission error happens during the translation of
the final GPA to HPA, walk_addr_generic returns 0 but does not fill
in walker->fault. To avoid this, add an x86_exception* argument
to the translate_gpa function, and let it fill in walker->fault.
The nested_page_fault field will be true, since the walk_mmu is the
nested_mmu and translate_gpu instead operates on the "outer" (NPT)
instance.
Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Sep 2014 17:46:15 +0000 (19:46 +0200)]
KVM: x86: skip writeback on injection of nested exception
If a nested page fault happens during emulation, we will inject a vmexit,
not a page fault. However because writeback happens after the injection,
we will write ctxt->eip from L2 into the L1 EIP. We do not write back
if an instruction caused an interception vmexit---do the same for page
faults.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 2 Sep 2014 11:18:37 +0000 (13:18 +0200)]
KVM: nSVM: propagate the NPF EXITINFO to the guest
This is similar to what the EPT code does with the exit qualification.
This allows the guest to see a valid value for bits 33:32.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 2 Sep 2014 11:24:12 +0000 (13:24 +0200)]
KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD
Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs, but reserves it
in PDPEs and PML4Es. The SVM test is relying on this behavior, so enforce it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tiejun Chen [Mon, 1 Sep 2014 10:44:04 +0000 (18:44 +0800)]
KVM: mmio: cleanup kvm_set_mmio_spte_mask
Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask()
for slightly better code.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Mon, 18 Aug 2014 22:46:07 +0000 (15:46 -0700)]
kvm: x86: fix stale mmio cache bug
The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:
(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.
(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.
(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.
This patch fixes the issue by using the memslot generation number
to validate the mmio cache.
Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Tested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Mon, 18 Aug 2014 22:46:06 +0000 (15:46 -0700)]
kvm: fix potentially corrupt mmio cache
vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.
If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.
We can prevent the following case:
vcpu (CPU 0) | thread (CPU 1)
--------------------------------------------+--------------------------
1 vm exit |
2 srcu_read_unlock(&kvm->srcu) |
3 decide to cache something based on |
old memslots |
4 | change memslots
| (increments generation)
5 | synchronize_srcu(&kvm->srcu);
6 retrieve generation # from new memslots |
7 tag cache with new memslot generation |
8 srcu_read_unlock(&kvm->srcu) |
... |
<action based on cache occurs even |
though the caching decision was based |
on the old memslots> |
... |
<action *continues* to occur until next |
memslot generation change, which may |
be never> |
|
By incrementing the generation after synchronizing with kvm->srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).
Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots. It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.
To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE. This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited. Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.
Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>