kernel/kernel-generic.git
16 years agoKVM: s390: update maintainers
Christian Borntraeger [Tue, 25 Mar 2008 17:47:41 +0000 (18:47 +0100)]
KVM: s390: update maintainers

This patch adds an entry for kvm on s390 to the MAINTAINERS file :-). We intend
to push all patches regarding this via Avi's kvm.git.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: API documentation
Carsten Otte [Tue, 25 Mar 2008 17:47:38 +0000 (18:47 +0100)]
KVM: s390: API documentation

This patch adds Documentation/s390/kvm.txt, which describes specifics of kvm's
user interface that are unique to s390 architecture.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: add kvm to kconfig on s390
Christian Borntraeger [Tue, 25 Mar 2008 17:47:36 +0000 (18:47 +0100)]
KVM: s390: add kvm to kconfig on s390

This patch adds the virtualization submenu and the kvm option to the kernel
config. It also defines HAVE_KVM for 64bit kernels.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: intercepts for diagnose instructions
Christian Borntraeger [Tue, 25 Mar 2008 17:47:34 +0000 (18:47 +0100)]
KVM: s390: intercepts for diagnose instructions

This patch introduces interpretation of some diagnose instruction intercepts.
Diagnose is our classic architected way of doing a hypercall. This patch
features the following diagnose codes:
- vm storage size, that tells the guest about its memory layout
- time slice end, which is used by the guest to indicate that it waits
  for a lock and thus cannot use up its time slice in a useful way
- ipl functions, which a guest can use to reset and reboot itself

In order to implement ipl functions, we also introduce an exit reason that
causes userspace to perform various resets on the virtual machine. All resets
are described in the principles of operation book, except KVM_S390_RESET_IPL
which causes a reboot of the machine.

Acked-by: Martin Schwidefsky <martin.schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: interprocessor communication via sigp
Christian Borntraeger [Tue, 25 Mar 2008 17:47:31 +0000 (18:47 +0100)]
KVM: s390: interprocessor communication via sigp

This patch introduces in-kernel handling of _some_ sigp interprocessor
signals (similar to ipi).
kvm_s390_handle_sigp() decodes the sigp instruction and calls individual
handlers depending on the operation requested:
- sigp sense tries to retrieve information such as existence or running state
  of the remote cpu
- sigp emergency sends an external interrupt to the remove cpu
- sigp stop stops a remove cpu
- sigp stop store status stops a remote cpu, and stores its entire internal
  state to the cpus lowcore
- sigp set arch sets the architecture mode of the remote cpu. setting to
  ESAME (s390x 64bit) is accepted, setting to ESA/S390 (s390, 31 or 24 bit) is
  denied, all others are passed to userland
- sigp set prefix sets the prefix register of a remote cpu

For implementation of this, the stop intercept indication starts to get reused
on purpose: a set of action bits defines what to do once a cpu gets stopped:
ACTION_STOP_ON_STOP  really stops the cpu when a stop intercept is recognized
ACTION_STORE_ON_STOP stores the cpu status to lowcore when a stop intercept is
                     recognized

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: intercepts for privileged instructions
Christian Borntraeger [Tue, 25 Mar 2008 17:47:29 +0000 (18:47 +0100)]
KVM: s390: intercepts for privileged instructions

This patch introduces in-kernel handling of some intercepts for privileged
instructions:

handle_set_prefix()        sets the prefix register of the local cpu
handle_store_prefix()      stores the content of the prefix register to memory
handle_store_cpu_address() stores the cpu number of the current cpu to memory
handle_skey()              just decrements the instruction address and retries
handle_stsch()             delivers condition code 3 "operation not supported"
handle_chsc()              same here
handle_stfl()              stores the facility list which contains the
                           capabilities of the cpu
handle_stidp()             stores cpu type/model/revision and such
handle_stsi()              stores information about the system topology

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: interrupt subsystem, cpu timer, waitpsw
Carsten Otte [Tue, 25 Mar 2008 17:47:26 +0000 (18:47 +0100)]
KVM: s390: interrupt subsystem, cpu timer, waitpsw

This patch contains the s390 interrupt subsystem (similar to in kernel apic)
including timer interrupts (similar to in-kernel-pit) and enabled wait
(similar to in kernel hlt).

In order to achieve that, this patch also introduces intercept handling
for instruction intercepts, and it implements load control instructions.

This patch introduces an ioctl KVM_S390_INTERRUPT which is valid for both
the vm file descriptors and the vcpu file descriptors. In case this ioctl is
issued against a vm file descriptor, the interrupt is considered floating.
Floating interrupts may be delivered to any virtual cpu in the configuration.

The following interrupts are supported:
SIGP STOP       - interprocessor signal that stops a remote cpu
SIGP SET PREFIX - interprocessor signal that sets the prefix register of a
                  (stopped) remote cpu
INT EMERGENCY   - interprocessor interrupt, usually used to signal need_reshed
                  and for smp_call_function() in the guest.
PROGRAM INT     - exception during program execution such as page fault, illegal
                  instruction and friends
RESTART         - interprocessor signal that starts a stopped cpu
INT VIRTIO      - floating interrupt for virtio signalisation
INT SERVICE     - floating interrupt for signalisations from the system
                  service processor

struct kvm_s390_interrupt, which is submitted as ioctl parameter when injecting
an interrupt, also carrys parameter data for interrupts along with the interrupt
type. Interrupts on s390 usually have a state that represents the current
operation, or identifies which device has caused the interruption on s390.

kvm_s390_handle_wait() does handle waitpsw in two flavors: in case of a
disabled wait (that is, disabled for interrupts), we exit to userspace. In case
of an enabled wait we set up a timer that equals the cpu clock comparator value
and sleep on a wait queue.

[christian: change virtio interrupt to 0x2603]

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: sie intercept handling
Christian Borntraeger [Tue, 25 Mar 2008 17:47:23 +0000 (18:47 +0100)]
KVM: s390: sie intercept handling

This path introduces handling of sie intercepts in three flavors: Intercepts
are either handled completely in-kernel by kvm_handle_sie_intercept(),
or passed to userspace with corresponding data in struct kvm_run in case
kvm_handle_sie_intercept() returns -ENOTSUPP.
In case of partial execution in kernel with the need of userspace support,
kvm_handle_sie_intercept() may choose to set up struct kvm_run and return
-EREMOTE.

The trivial intercept reasons are handled in this patch:
handle_noop() just does nothing for intercepts that don't require our support
  at all
handle_stop() is called when a cpu enters stopped state, and it drops out to
  userland after updating our vcpu state
handle_validity() faults in the cpu lowcore if needed, or passes the request
  to userland

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: s390: arch backend for the kvm kernel module
Heiko Carstens [Tue, 25 Mar 2008 17:47:20 +0000 (18:47 +0100)]
KVM: s390: arch backend for the kvm kernel module

This patch contains the port of Qumranet's kvm kernel module to IBM zSeries
 (aka s390x, mainframe) architecture. It uses the mainframe's virtualization
instruction SIE to run virtual machines with up to 64 virtual CPUs each.
This port is only usable on 64bit host kernels, and can only run 64bit guest
kernels. However, running 31bit applications in guest userspace is possible.

The following source files are introduced by this patch
arch/s390/kvm/kvm-s390.c    similar to arch/x86/kvm/x86.c, this implements all
                            arch callbacks for kvm. __vcpu_run calls back into
                            sie64a to enter the guest machine context
arch/s390/kvm/sie64a.S      assembler function sie64a, which enters guest
                            context via SIE, and switches world before and after                            that
include/asm-s390/kvm_host.h contains all vital data structures needed to run
                            virtual machines on the mainframe
include/asm-s390/kvm.h      defines kvm_regs and friends for user access to
                            guest register content
arch/s390/kvm/gaccess.h     functions similar to uaccess to access guest memory
arch/s390/kvm/kvm-s390.h    header file for kvm-s390 internals, extended by
                            later patches

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agos390: KVM preparation: address of the 64bit extint parm in lowcore
Christian Borntraeger [Tue, 25 Mar 2008 17:47:15 +0000 (18:47 +0100)]
s390: KVM preparation: address of the 64bit extint parm in lowcore

The address 0x11b8 is used by z/VM for pfault and diag 250 I/O to
provide a 64 bit extint parameter. virtio uses the same address, so
its time to update the lowcore structure.

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agos390: KVM preparation: host memory management changes for s390 kvm
Christian Borntraeger [Tue, 25 Mar 2008 17:47:12 +0000 (18:47 +0100)]
s390: KVM preparation: host memory management changes for s390 kvm

This patch changes the s390 memory management defintions to use the pgste field
for dirty and reference bit tracking of host and guest code. Usually on s390,
dirty and referenced are tracked in storage keys, which belong to the physical
page. This changes with virtualization: The guest and host dirty/reference bits
are defined to be the logical OR of the values for the mapping and the physical
page. This patch implements the necessary changes in pgtable.h for s390.

There is a common code change in mm/rmap.c, the call to
page_test_and_clear_young must be moved. This is a no-op for all
architecture but s390. page_referenced checks the referenced bits for
the physiscal page and for all mappings:
o The physical page is checked with page_test_and_clear_young.
o The mappings are checked with ptep_test_and_clear_young and friends.

Without pgstes (the current implementation on Linux s390) the physical page
check is implemented but the mapping callbacks are no-ops because dirty
and referenced are not tracked in the s390 page tables. The pgstes introduces
guest and host dirty and reference bits for s390 in the host mapping. These
mapping must be checked before page_test_and_clear_young resets the reference
bit.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agos390: KVM preparation: provide hook to enable pgstes in user pagetable
Carsten Otte [Tue, 25 Mar 2008 17:47:10 +0000 (18:47 +0100)]
s390: KVM preparation: provide hook to enable pgstes in user pagetable

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86: hardware task switching support
Izik Eidus [Mon, 24 Mar 2008 21:14:53 +0000 (23:14 +0200)]
KVM: x86: hardware task switching support

This emulates the x86 hardware task switch mechanism in software, as it is
unsupported by either vmx or svm.  It allows operating systems which use it,
like freedos, to run as kvm guests.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86: add functions to get the cpl of vcpu
Izik Eidus [Mon, 24 Mar 2008 17:38:34 +0000 (19:38 +0200)]
KVM: x86: add functions to get the cpl of vcpu

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Add module option to disable flexpriority
Avi Kivity [Mon, 24 Mar 2008 16:15:14 +0000 (18:15 +0200)]
KVM: VMX: Add module option to disable flexpriority

Useful for debugging.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: no longer EXPERIMENTAL
Avi Kivity [Sun, 23 Mar 2008 16:36:30 +0000 (18:36 +0200)]
KVM: no longer EXPERIMENTAL

Long overdue.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Introduce and use spte_to_page()
Avi Kivity [Sun, 23 Mar 2008 13:06:23 +0000 (15:06 +0200)]
KVM: MMU: Introduce and use spte_to_page()

Encapsulate the pte mask'n'shift in a function.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: fix dirty bit setting when removing write permissions
Izik Eidus [Thu, 20 Mar 2008 16:17:24 +0000 (18:17 +0200)]
KVM: MMU: fix dirty bit setting when removing write permissions

When mmu_set_spte() checks if a page related to spte should be release as
dirty or clean, it check if the shadow pte was writeble, but in case
rmap_write_protect() is called called it is possible for shadow ptes that were
writeble to become readonly and therefor mmu_set_spte will release the pages
as clean.

This patch fix this issue by marking the page as dirty inside
rmap_write_protect().

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Move some x86 specific constants and structures to include/asm-x86
Avi Kivity [Fri, 21 Mar 2008 10:38:23 +0000 (12:38 +0200)]
KVM: Move some x86 specific constants and structures to include/asm-x86

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Set the accessed bit on non-speculative shadow ptes
Avi Kivity [Tue, 18 Mar 2008 09:05:52 +0000 (11:05 +0200)]
KVM: MMU: Set the accessed bit on non-speculative shadow ptes

If we populate a shadow pte due to a fault (and not speculatively due to a
pte write) then we can set the accessed bit on it, as we know it will be
set immediately on the next guest instruction.  This saves a read-modify-write
operation.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: kvm.h: __user requires compiler.h
Christian Borntraeger [Wed, 12 Mar 2008 17:10:45 +0000 (18:10 +0100)]
KVM: kvm.h: __user requires compiler.h

include/linux/kvm.h defines struct kvm_dirty_log to
[...]
union {
void __user *dirty_bitmap; /* one bit per page */
__u64 padding;
};

__user requires compiler.h to compile. Currently, this works on x86
only coincidentally due to other include files. This patch makes
kvm.h compile in all cases.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: KVM guest: disable clock before rebooting.
Glauber Costa [Mon, 17 Mar 2008 19:08:40 +0000 (16:08 -0300)]
x86: KVM guest: disable clock before rebooting.

This patch writes 0 (actually, what really matters is that the
LSB is cleared) to the system time msr before shutting down
the machine for kexec.

Without it, we can have a random memory location being written
when the guest comes back

It overrides the functions shutdown, used in the path of kernel_kexec() (sys.c)
and crash_shutdown, used in the path of crash_kexec() (kexec.c)

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: make native_machine_shutdown non-static
Glauber Costa [Mon, 17 Mar 2008 19:08:39 +0000 (16:08 -0300)]
x86: make native_machine_shutdown non-static

it will allow external users to call it. It is mainly
useful for routines that will override its machine_ops
field for its own special purposes, but want to call the
normal shutdown routine after they're done

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: allow machine_crash_shutdown to be replaced
Glauber Costa [Mon, 17 Mar 2008 19:08:38 +0000 (16:08 -0300)]
x86: allow machine_crash_shutdown to be replaced

This patch a llows machine_crash_shutdown to
be replaced, just like any of the other functions
in machine_ops

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: KVM guest: hypercall batching
Marcelo Tosatti [Fri, 22 Feb 2008 17:21:38 +0000 (12:21 -0500)]
x86: KVM guest: hypercall batching

Batch pte updates and tlb flushes in lazy MMU mode.

[avi:
 - adjust to mmu_op
 - helper for getting para_state without debug warnings]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: KVM guest: hypercall based pte updates and TLB flushes
Marcelo Tosatti [Fri, 22 Feb 2008 17:21:37 +0000 (12:21 -0500)]
x86: KVM guest: hypercall based pte updates and TLB flushes

Hypercall based pte updates are faster than faults, and also allow use
of the lazy MMU mode to batch operations.

Don't report the feature if two dimensional paging is enabled.

[avi:
 - guest/host split
 - fix 32-bit truncation issues
 - adjust to mmu_op
 - adjust to ->release_*() renamed
 - add ->release_pud()]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: hypercall based pte updates and TLB flushes
Marcelo Tosatti [Fri, 22 Feb 2008 17:21:37 +0000 (12:21 -0500)]
KVM: MMU: hypercall based pte updates and TLB flushes

Hypercall based pte updates are faster than faults, and also allow use
of the lazy MMU mode to batch operations.

Don't report the feature if two dimensional paging is enabled.

[avi:
 - one mmu_op hypercall instead of one per op
 - allow 64-bit gpa on hypercall
 - don't pass host errors (-ENOMEM) to guest]

[akpm: warning fix on i386]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Provide unlocked version of emulator_write_phys()
Avi Kivity [Sun, 2 Mar 2008 12:06:05 +0000 (14:06 +0200)]
KVM: Provide unlocked version of emulator_write_phys()

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: KVM guest: add basic paravirt support
Marcelo Tosatti [Fri, 22 Feb 2008 17:21:36 +0000 (12:21 -0500)]
x86: KVM guest: add basic paravirt support

Add basic KVM paravirt support. Avoid vm-exits on IO delays.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: add basic paravirt support
Marcelo Tosatti [Fri, 22 Feb 2008 17:21:36 +0000 (12:21 -0500)]
KVM: add basic paravirt support

Add basic KVM paravirt support. Avoid vm-exits on IO delays.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add reset support for in kernel PIT
Sheng Yang [Thu, 13 Mar 2008 02:22:26 +0000 (10:22 +0800)]
KVM: Add reset support for in kernel PIT

Separate the reset part and prepare for reset support.

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add save/restore supporting of in kernel PIT
Sheng Yang [Mon, 3 Mar 2008 16:50:59 +0000 (00:50 +0800)]
KVM: Add save/restore supporting of in kernel PIT

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: In kernel PIT model
Sheng Yang [Sun, 27 Jan 2008 21:10:22 +0000 (05:10 +0800)]
KVM: In kernel PIT model

The patch moves the PIT model from userspace to kernel, and increases
the timer accuracy greatly.

[marcelo: make last_injected_time per-guest]

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-and-Acked-by: Alex Davis <alex14641@yahoo.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Remove pointless desc_ptr #ifdef
Avi Kivity [Wed, 5 Mar 2008 07:33:44 +0000 (09:33 +0200)]
KVM: Remove pointless desc_ptr #ifdef

The desc_struct changes left an unnecessary #ifdef; remove it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Don't adjust tsc offset forward
Avi Kivity [Tue, 4 Mar 2008 08:44:51 +0000 (10:44 +0200)]
KVM: VMX: Don't adjust tsc offset forward

Most Intel hosts have a stable tsc, and playing with the offset only
reduces accuracy.  By limiting tsc offset adjustment only to forward updates,
we effectively disable tsc offset adjustment on these hosts.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: replace remaining __FUNCTION__ occurances
Harvey Harrison [Mon, 3 Mar 2008 20:59:56 +0000 (12:59 -0800)]
KVM: replace remaining __FUNCTION__ occurances

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: detect if VCPU triple faults
Joerg Roedel [Tue, 26 Feb 2008 15:49:16 +0000 (16:49 +0100)]
KVM: detect if VCPU triple faults

In the current inject_page_fault path KVM only checks if there is another PF
pending and injects a DF then. But it has to check for a pending DF too to
detect a shutdown condition in the VCPU.  If this is not detected the VCPU goes
to a PF -> DF -> PF loop when it should triple fault. This patch detects this
condition and handles it with an KVM_SHUTDOWN exit to userspace.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Use kzalloc to avoid allocating kvm_regs from kernel stack
Xiantao Zhang [Mon, 25 Feb 2008 10:52:20 +0000 (18:52 +0800)]
KVM: Use kzalloc to avoid allocating kvm_regs from kernel stack

Since the size of kvm_regs is too big to allocate from kernel stack on ia64,
use kzalloc to allocate it.

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Prefix control register accessors with kvm_ to avoid namespace pollution
Avi Kivity [Sun, 24 Feb 2008 09:20:43 +0000 (11:20 +0200)]
KVM: Prefix control register accessors with kvm_ to avoid namespace pollution

Names like 'set_cr3()' look dangerously close to affecting the host.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: large page support
Marcelo Tosatti [Sat, 23 Feb 2008 14:44:30 +0000 (11:44 -0300)]
KVM: MMU: large page support

Create large pages mappings if the guest PTE's are marked as such and
the underlying memory is hugetlbfs backed.  If the largepage contains
write-protected pages, a large pte is not used.

Gives a consistent 2% improvement for data copies on ram mounted
filesystem, without NPT/EPT.

Anthony measures a 4% improvement on 4-way kernbench, with NPT.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: ignore zapped root pagetables
Marcelo Tosatti [Wed, 20 Feb 2008 19:47:24 +0000 (14:47 -0500)]
KVM: MMU: ignore zapped root pagetables

Mark zapped root pagetables as invalid and ignore such pages during lookup.

This is a problem with the cr3-target feature, where a zapped root table fools
the faulting code into creating a read-only mapping. The result is a lockup
if the instruction can't be emulated.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Implement dummy values for MSR_PERF_STATUS
Alexander Graf [Thu, 21 Feb 2008 11:11:01 +0000 (12:11 +0100)]
KVM: Implement dummy values for MSR_PERF_STATUS

Darwin relies on this and ceases to work without.

Signed-off-by: Alexander Graf <alex@csgraf.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: sparse fixes for kvm/x86.c
Harvey Harrison [Tue, 19 Feb 2008 18:25:50 +0000 (10:25 -0800)]
KVM: sparse fixes for kvm/x86.c

In two case statements, use the ever popular 'i' instead of index:
arch/x86/kvm/x86.c:1063:7: warning: symbol 'index' shadows an earlier one
arch/x86/kvm/x86.c:1000:9: originally declared here
arch/x86/kvm/x86.c:1079:7: warning: symbol 'index' shadows an earlier one
arch/x86/kvm/x86.c:1000:9: originally declared here

Make it static.
arch/x86/kvm/x86.c:1945:24: warning: symbol 'emulate_ops' was not declared. Should it be static?

Drop the return statements.
arch/x86/kvm/x86.c:2878:2: warning: returning void-valued expression
arch/x86/kvm/x86.c:2944:2: warning: returning void-valued expression

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: make iopm_base static
Harvey Harrison [Tue, 19 Feb 2008 18:32:02 +0000 (10:32 -0800)]
KVM: SVM: make iopm_base static

Fixes sparse warning as well.
arch/x86/kvm/svm.c:69:15: warning: symbol 'iopm_base' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: fix sparse warnings in x86_emulate.c
Harvey Harrison [Tue, 19 Feb 2008 18:43:11 +0000 (10:43 -0800)]
KVM: x86 emulator: fix sparse warnings in x86_emulate.c

Nesting __emulate_2op_nobyte inside__emulate_2op produces many shadowed
variable warnings on the internal variable _tmp used by both macros.

Change the outer macro to use __tmp.

Avoids a sparse warning like the following at every call site of __emulate_2op
arch/x86/kvm/x86_emulate.c:1091:3: warning: symbol '_tmp' shadows an earlier one
arch/x86/kvm/x86_emulate.c:1091:3: originally declared here
[18 more warnings suppressed]

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add stat counter for hypercalls
Amit Shah [Wed, 20 Feb 2008 19:30:30 +0000 (01:00 +0530)]
KVM: Add stat counter for hypercalls

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Use x86's segment descriptor struct instead of private definition
Avi Kivity [Wed, 20 Feb 2008 15:57:21 +0000 (17:57 +0200)]
KVM: Use x86's segment descriptor struct instead of private definition

The x86 desc_struct unification allows us to remove segment_descriptor.h.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Increase the number of user memory slots per vm
Avi Kivity [Wed, 20 Feb 2008 10:04:47 +0000 (12:04 +0200)]
KVM: Increase the number of user memory slots per vm

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add API for determining the number of supported memory slots
Avi Kivity [Wed, 20 Feb 2008 09:59:20 +0000 (11:59 +0200)]
KVM: Add API for determining the number of supported memory slots

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Increase vcpu count to 16
Avi Kivity [Wed, 20 Feb 2008 09:56:51 +0000 (11:56 +0200)]
KVM: Increase vcpu count to 16

With NPT support, scalability is much improved.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Add API to retrieve the number of supported vcpus per vm
Avi Kivity [Wed, 20 Feb 2008 09:53:16 +0000 (11:53 +0200)]
KVM: Add API to retrieve the number of supported vcpus per vm

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: make register_address_increment and JMP_REL static inlines
Harvey Harrison [Tue, 19 Feb 2008 15:40:41 +0000 (07:40 -0800)]
KVM: x86 emulator: make register_address_increment and JMP_REL static inlines

Change jmp_rel() to a function as well.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: make register_address, address_mask static inlines
Harvey Harrison [Tue, 19 Feb 2008 15:40:38 +0000 (07:40 -0800)]
KVM: x86 emulator: make register_address, address_mask static inlines

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: add ad_mask static inline
Harvey Harrison [Mon, 18 Feb 2008 19:12:48 +0000 (11:12 -0800)]
KVM: x86 emulator: add ad_mask static inline

Replaces open-coded mask calculation in macros.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agox86: KVM guest: paravirtualized clocksource
Glauber de Oliveira Costa [Fri, 15 Feb 2008 19:52:48 +0000 (17:52 -0200)]
x86: KVM guest: paravirtualized clocksource

This is the guest part of kvm clock implementation
It does not do tsc-only timing, as tsc can have deltas
between cpus, and it did not seem worthy to me to keep
adjusting them.

We do use it, however, for fine-grained adjustment.

Other than that, time comes from the host.

[randy dunlap: add missing include]
[randy dunlap: disallow on Voyager or Visual WS]

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: paravirtualized clocksource: host part
Glauber de Oliveira Costa [Fri, 15 Feb 2008 19:52:47 +0000 (17:52 -0200)]
KVM: paravirtualized clocksource: host part

This is the host part of kvm clocksource implementation. As it does
not include clockevents, it is a fairly simple implementation. We
only have to register a per-vcpu area, and start writing to it periodically.

The area is binary compatible with xen, as we use the same shadow_info
structure.

[marcelo: fix bad_page on MSR_KVM_SYSTEM_TIME]
[avi: save full value of the msr, even if enable bit is clear]
[avi: clear previous value of time_page]

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: enable LBR virtualization
Joerg Roedel [Wed, 13 Feb 2008 17:58:47 +0000 (18:58 +0100)]
KVM: SVM: enable LBR virtualization

This patch implements the Last Branch Record Virtualization (LBRV) feature of
the AMD Barcelona and Phenom processors into the kvm-amd module. It will only
be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So
there is no increased world switch overhead when the guest doesn't use these
MSRs.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: allocate the MSR permission map per VCPU
Joerg Roedel [Wed, 13 Feb 2008 17:58:46 +0000 (18:58 +0100)]
KVM: SVM: allocate the MSR permission map per VCPU

This patch changes the kvm-amd module to allocate the SVM MSR permission map
per VCPU instead of a global map for all VCPUs. With this we have more
flexibility allowing specific guests to access virtualized MSRs. This is
required for LBR virtualization.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: let init_vmcb() take struct vcpu_svm as parameter
Joerg Roedel [Wed, 13 Feb 2008 17:58:45 +0000 (18:58 +0100)]
KVM: SVM: let init_vmcb() take struct vcpu_svm as parameter

Change the parameter of the init_vmcb() function in the kvm-amd module from
struct vmcb to struct vcpu_svm.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: fix typo in VMX header define
Ryan Harper [Mon, 11 Feb 2008 16:26:38 +0000 (10:26 -0600)]
KVM: VMX: fix typo in VMX header define

Looking at Intel Volume 3b, page 148, table 20-11 and noticed
that the field name is 'Deliver' not 'Deliever'.  Attached patch changes
the define name and its user in vmx.c

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: add support for Nested Paging
Joerg Roedel [Thu, 7 Feb 2008 12:47:45 +0000 (13:47 +0100)]
KVM: SVM: add support for Nested Paging

This patch contains the SVM architecture dependent changes for KVM to enable
support for the Nested Paging feature of AMD Barcelona and Phenom processors.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: add TDP support to the KVM MMU
Joerg Roedel [Thu, 7 Feb 2008 12:47:44 +0000 (13:47 +0100)]
KVM: MMU: add TDP support to the KVM MMU

This patch contains the changes to the KVM MMU necessary for support of the
Nested Paging feature in AMD Barcelona and Phenom Processors.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: export the load_pdptrs() function to modules
Joerg Roedel [Thu, 7 Feb 2008 12:47:43 +0000 (13:47 +0100)]
KVM: export the load_pdptrs() function to modules

The load_pdptrs() function is required in the SVM module for NPT support.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: make the __nonpaging_map function generic
Joerg Roedel [Thu, 7 Feb 2008 12:47:42 +0000 (13:47 +0100)]
KVM: MMU: make the __nonpaging_map function generic

The mapping function for the nonpaging case in the softmmu does basically the
same as required for Nested Paging. Make this function generic so it can be
used for both.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: export information about NPT to generic x86 code
Joerg Roedel [Thu, 7 Feb 2008 12:47:41 +0000 (13:47 +0100)]
KVM: export information about NPT to generic x86 code

The generic x86 code has to know if the specific implementation uses Nested
Paging. In the generic code Nested Paging is called Two Dimensional Paging
(TDP) to avoid confusion with (future) TDP implementations of other vendors.
This patch exports the availability of TDP to the generic x86 code.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: add module parameter to disable Nested Paging
Joerg Roedel [Thu, 7 Feb 2008 12:47:40 +0000 (13:47 +0100)]
KVM: SVM: add module parameter to disable Nested Paging

To disable the use of the Nested Paging feature even if it is available in
hardware this patch adds a module parameter. Nested Paging can be disabled by
passing npt=0 to the kvm_amd module.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: add detection of Nested Paging feature
Joerg Roedel [Thu, 7 Feb 2008 12:47:39 +0000 (13:47 +0100)]
KVM: SVM: add detection of Nested Paging feature

Let SVM detect if the Nested Paging feature is available on the hardware.
Disable it to keep this patch series bisectable.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: SVM: move feature detection to hardware setup code
Joerg Roedel [Thu, 7 Feb 2008 12:47:38 +0000 (13:47 +0100)]
KVM: SVM: move feature detection to hardware setup code

By moving the SVM feature detection from the each_cpu code to the hardware
setup code it runs only once. As an additional advance the feature check is now
available earlier in the module setup process.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: allow access to EFER in 32bit KVM
Joerg Roedel [Thu, 31 Jan 2008 13:57:40 +0000 (14:57 +0100)]
KVM: allow access to EFER in 32bit KVM

This patch makes the EFER register accessible on a 32bit KVM host. This is
necessary to boot 32 bit PAE guests under SVM.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: unifdef the EFER specific code
Joerg Roedel [Thu, 31 Jan 2008 13:57:39 +0000 (14:57 +0100)]
KVM: VMX: unifdef the EFER specific code

To allow access to the EFER register in 32bit KVM the EFER specific code has to
be exported to the x86 generic code. This patch does this in a backwards
compatible manner.

[avi: add check for EFER-less hosts]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: align valid EFER bits with the features of the host system
Joerg Roedel [Thu, 31 Jan 2008 13:57:38 +0000 (14:57 +0100)]
KVM: align valid EFER bits with the features of the host system

This patch aligns the bits the guest can set in the EFER register with the
features in the host processor. Currently it lets EFER.NX disabled if the
processor does not support it and enables EFER.LME and EFER.LMA only for KVM on
64 bit hosts.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: make EFER_RESERVED_BITS configurable for architecture code
Joerg Roedel [Thu, 31 Jan 2008 13:57:37 +0000 (14:57 +0100)]
KVM: make EFER_RESERVED_BITS configurable for architecture code

This patch give the SVM and VMX implementations the ability to add some bits
the guest can set in its EFER register.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Disable pagefaults during copy_from_user_inatomic()
Andrea Arcangeli [Wed, 30 Jan 2008 18:57:35 +0000 (19:57 +0100)]
KVM: Disable pagefaults during copy_from_user_inatomic()

With CONFIG_PREEMPT=n, this is needed in order to disable the fault-in
code from sleeping.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Use CONFIG_PREEMPT_NOTIFIERS around struct preempt_notifier
Hollis Blanchard [Mon, 28 Jan 2008 23:42:34 +0000 (17:42 -0600)]
KVM: Use CONFIG_PREEMPT_NOTIFIERS around struct preempt_notifier

This allows kvm_host.h to be #included even when struct preempt_notifier is
undefined. This is needed to build ppc asm-offsets.h.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: VMX: Enable Virtual Processor Identification (VPID)
Sheng Yang [Thu, 17 Jan 2008 07:14:33 +0000 (15:14 +0800)]
KVM: VMX: Enable Virtual Processor Identification (VPID)

To allow TLB entries to be retained across VM entry and VM exit, the VMM
can now identify distinct address spaces through a new virtual-processor ID
(VPID) field of the VMCS.

[avi: drop vpid_sync_all()]
[avi: add "cc" to asm constraints]

Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Limit vcpu mmap size to one page on non-x86
Avi Kivity [Thu, 24 Jan 2008 13:13:08 +0000 (15:13 +0200)]
KVM: Limit vcpu mmap size to one page on non-x86

The second page is only needed on archs that support pio.

Noted by Carsten Otte.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Decouple mmio from shadow page tables
Avi Kivity [Thu, 24 Jan 2008 09:44:11 +0000 (11:44 +0200)]
KVM: MMU: Decouple mmio from shadow page tables

Currently an mmio guest pte is encoded in the shadow pagetable as a
not-present trapping pte, with the SHADOW_IO_MARK bit set.  However
nothing is ever done with this information, so maintaining it is a
useless complication.

This patch moves the check for mmio to before shadow ptes are instantiated,
so the shadow code is never invoked for ptes that reference mmio.  The code
is simpler, and with future work, can be made to handle mmio concurrently.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: group decoding for group 1 instructions
Avi Kivity [Wed, 23 Jan 2008 20:26:09 +0000 (22:26 +0200)]
KVM: x86 emulator: group decoding for group 1 instructions

Opcodes 0x80-0x83

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: Only x86 has pio
Avi Kivity [Wed, 23 Jan 2008 16:14:23 +0000 (18:14 +0200)]
KVM: Only x86 has pio

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: constify function pointer tables
Jan Engelhardt [Tue, 22 Jan 2008 19:46:14 +0000 (20:46 +0100)]
KVM: constify function pointer tables

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: add group 7 decoding
Avi Kivity [Fri, 18 Jan 2008 11:36:50 +0000 (13:36 +0200)]
KVM: x86 emulator: add group 7 decoding

This adds group decoding for opcode 0x0f 0x01 (group 7).

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Group decoding for groups 4 and 5
Avi Kivity [Fri, 18 Jan 2008 11:12:26 +0000 (13:12 +0200)]
KVM: x86 emulator: Group decoding for groups 4 and 5

Add group decoding support for opcode 0xfe (group 4) and 0xff (group 5).

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: Group decoding for group 3
Avi Kivity [Fri, 18 Jan 2008 10:58:04 +0000 (12:58 +0200)]
KVM: x86 emulator: Group decoding for group 3

This adds group decoding support for opcodes 0xf6, 0xf7 (group 3).

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: group decoding for group 1A
Avi Kivity [Fri, 18 Jan 2008 10:46:50 +0000 (12:46 +0200)]
KVM: x86 emulator: group decoding for group 1A

This adds group decode support for opcode 0x8f.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: x86 emulator: add support for group decoding
Avi Kivity [Fri, 18 Jan 2008 10:38:59 +0000 (12:38 +0200)]
KVM: x86 emulator: add support for group decoding

Certain x86 instructions use bits 3:5 of the byte following the opcode as an
opcode extension, with the decode sometimes depending on bits 6:7 as well.
Add support for this in the main decoding table rather than an ad-hock
adaptation per opcode.

Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Simplify hash table indexing
Dong, Eddie [Mon, 7 Jan 2008 11:20:25 +0000 (13:20 +0200)]
KVM: MMU: Simplify hash table indexing

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoKVM: MMU: Update shadow ptes on partial guest pte writes
Dong, Eddie [Mon, 7 Jan 2008 09:14:20 +0000 (11:14 +0200)]
KVM: MMU: Update shadow ptes on partial guest pte writes

A guest partial guest pte write will leave shadow_trap_nonpresent_pte
in spte, which generates a vmexit at the next guest access through that pte.

This patch improves this by reading the full guest pte in advance and thus
being able to update the spte and eliminate the vmexit.

This helps pae guests which use two 32-bit writes to set a single 64-bit pte.

[truncation fix by Eric]

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
16 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux...
Linus Torvalds [Sat, 26 Apr 2008 21:04:32 +0000 (14:04 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v3

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v3:
  x86_64/mm: check and print vmemmap allocation continuous
  x86_64: fix setup_node_bootmem to support big mem excluding with memmap
  x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
  mm: allow reserve_bootmem() cross nodes
  mm: offset align in alloc_bootmem()
  mm: fix alloc_bootmem_core to use fast searching for all nodes
  mm: make mem_map allocation continuous

16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
Linus Torvalds [Sat, 26 Apr 2008 21:03:54 +0000 (14:03 -0700)]
Merge git://git./linux/kernel/git/sam/kbuild

* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild:
  kbuild: scripts/Makefile.modpost typo fix
  kbuild: soften MODULE_LICENSE check

16 years agox86_64/mm: check and print vmemmap allocation continuous
Yinghai Lu [Sat, 12 Apr 2008 08:19:24 +0000 (01:19 -0700)]
x86_64/mm: check and print vmemmap allocation continuous

On big systems with lots of memory, don't print out too much during
bootup, and make it easy to find if it is continuous.

on 256G 8 sockets system will get
 [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
[ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
 [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
[ffffe20038700000-ffffe200387fffff] potential offnode page_structs
 [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
 [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
 [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
[ffffe20054700000-ffffe200547fffff] potential offnode page_structs
 [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
[ffffe20070700000-ffffe200707fffff] potential offnode page_structs
 [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
 [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
 [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
[ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
 [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
[ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
 [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
 [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
 [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
[ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
 [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
 [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7

instead of a very long print out...

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16 years agox86_64: fix setup_node_bootmem to support big mem excluding with memmap
Yinghai Lu [Tue, 18 Mar 2008 19:52:37 +0000 (12:52 -0700)]
x86_64: fix setup_node_bootmem to support big mem excluding with memmap

typical case: four sockets system, every node has 4g ram, and we are using:

memmap=10g$4g

to mask out memory on node1 and node2

when numa is enabled, early_node_mem is used to get node_data and node_bootmap.

if it can not get memory from the same node with find_e820_area(), it will
use alloc_bootmem to get buff from previous nodes.

so check it and print out some info about it.

need to move early_res_to_bootmem into every setup_node_bootmem.
and it takes range that node has. otherwise alloc_bootmem could return addr
that reserved early.

depends on "mm: make reserve_bootmem can crossed the nodes".

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agox86_64: make reserve_bootmem_generic() use new reserve_bootmem()
Yinghai Lu [Tue, 18 Mar 2008 19:50:21 +0000 (12:50 -0700)]
x86_64: make reserve_bootmem_generic() use new reserve_bootmem()

"mm: make reserve_bootmem can crossed the nodes" provides new
reserve_bootmem(), let reserve_bootmem_generic() use that.

reserve_bootmem_generic() is used to reserve initramdisk, so this way
we can make sure even when bootloader or kexec load ranges cross the
node memory boundaries, reserve_bootmem still works.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agomm: allow reserve_bootmem() cross nodes
Yinghai Lu [Tue, 18 Mar 2008 19:49:12 +0000 (12:49 -0700)]
mm: allow reserve_bootmem() cross nodes

split reserve_bootmem_core() into two functions, one which checks
conflicts, and one which sets the bits.

and make reserve_bootmem to loop bdata_list to cross the nodes.

user could be crashkernel and ramdisk..., in case the range provided
by those externalities crosses the nodes.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agomm: offset align in alloc_bootmem()
Yinghai Lu [Tue, 18 Mar 2008 19:44:48 +0000 (12:44 -0700)]
mm: offset align in alloc_bootmem()

need offset alignment when node_boot_start's alignment is less than
the alignment required.

use local node_boot_start to match alignment - so don't add extra operation
in search loop.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agomm: fix alloc_bootmem_core to use fast searching for all nodes
Yinghai Lu [Tue, 11 Mar 2008 06:23:42 +0000 (23:23 -0700)]
mm: fix alloc_bootmem_core to use fast searching for all nodes

Make the nodes other than node 0 use bdata->last_success for fast
search too.

We need to use __alloc_bootmem_core() for vmemmap allocation for other
nodes when numa and sparsemem/vmemmap are enabled.

Also, make fail_block path increase i with incr only after ALIGN
to avoid extra increase when size is larger than align.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agomm: make mem_map allocation continuous
Yinghai Lu [Sun, 13 Apr 2008 18:51:06 +0000 (11:51 -0700)]
mm: make mem_map allocation continuous

vmemmap allocation currently has this layout:

 [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0
 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0
 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0
 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0
...

note that there is a 2M hole between them - not optimal.

the root cause is that usemap (24 bytes) will be allocated after every 2M
mem_map, and it will push next vmemmap (2M) to the next (2M) alignment.

solution: try to allocate the mem_map continously.

after the patch, we get:

 [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0
 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0
 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0
 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0
...

which is the ideal layout.

and usemap will share a page because of they are allocated continuously too:

sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24
...

so we make the bootmem allocation more compact and use less memory
for usemap => mission accomplished ;-)

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux...
Linus Torvalds [Sat, 26 Apr 2008 20:46:11 +0000 (13:46 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/x86/linux-2.6-generic-bitops-v3

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-generic-bitops-v3:
  x86, bitops: select the generic bitmap search functions
  x86: include/asm-x86/pgalloc.h/bitops.h: checkpatch cleanups - formatting only
  x86: finalize bitops unification
  x86, UML: remove x86-specific implementations of find_first_bit
  x86: optimize find_first_bit for small bitmaps
  x86: switch 64-bit to generic find_first_bit
  x86: generic versions of find_first_(zero_)bit, convert i386
  bitops: use __fls for fls64 on 64-bit archs
  generic: implement __fls on all 64-bit archs
  generic: introduce a generic __fls implementation
  x86: merge the simple bitops and move them to bitops.h
  x86, generic: optimize find_next_(zero_)bit for small constant-size bitmaps
  x86, uml: fix uml with generic find_next_bit for x86
  x86: change x86 to use generic find_next_bit
  uml: Kconfig cleanup
  uml: fix build error

16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
Linus Torvalds [Sat, 26 Apr 2008 20:44:19 +0000 (13:44 -0700)]
Merge git://git./linux/kernel/git/bart/ide-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (46 commits)
  ide: constify struct ide_dma_ops
  ide: add struct ide_dma_ops (take 3)
  ide: add IDE_HFLAG_SERIALIZE_DMA host flag
  sl82c105: check bridge revision in sl82c105_init_one()
  au1xxx-ide: use ->init_dma method
  palm_bk3710: use ->init_dma method
  sgiioc4: use ->init_dma method
  icside: use ->init_dma method
  ide-pmac: use ->init_dma method
  ide: do complete DMA setup in ->init_dma method (take 2)
  au1xxx-ide: fix MWDMA support
  ide: cleanup ide_setup_dma()
  ide: factor out setting PCI bus-mastering from ide_hwif_setup_dma()
  ide: export ide_allocate_dma_engine()
  ide: move ide_setup_dma() call out from ->init_dma method
  alim15x3: skip DMA initialization completely on revs < 0x20
  pdc202xx_old: remove init_dma_pdc202xx()
  ide: don't display "BIOS" settings in ide_setup_dma()
  ide: remove ->cds field from ide_hwif_t (take 2)
  ide: remove ide_dma_iobase()
  ...

16 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux...
Linus Torvalds [Sat, 26 Apr 2008 20:29:41 +0000 (13:29 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/x86/linux-2.6-x86-bigbox-bootparam

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootparam:
  x86, boot: Document for linked list of struct setup_data
  x86, boot: export linked list of struct setup_data via debugfs
  x86, boot: add linked list of struct setup_data
  x86, boot: add free_early to early reservation machanism

16 years agoide: constify struct ide_dma_ops
Bartlomiej Zolnierkiewicz [Sat, 26 Apr 2008 20:25:24 +0000 (22:25 +0200)]
ide: constify struct ide_dma_ops

* Export ide_dma_exec_cmd() and __ide_dma_test_irq().

* Constify struct ide_dma_ops.

* Always set hwif->dma_ops to &sff_dma_ops in ide_setup_dma()
  (it is later overriden by ide_init_port() if needed) and drop
  'const struct ide_port_info *d' argument.

While at it:

* Rename __ide_dma_test_irq() to ide_dma_test_irq().

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>