Gleb Natapov [Thu, 20 Dec 2012 14:57:42 +0000 (16:57 +0200)]
KVM: emulator: drop RPL check from linearize() function
According to Intel SDM Vol3 Section 5.5 "Privilege Levels" and 5.6
"Privilege Level Checking When Accessing Data Segments" RPL checking is
done during loading of a segment selector, not during data access. We
already do checking during segment selector loading, so drop the check
during data access. Checking RPL during data access triggers #GP if
after transition from real mode to protected mode RPL bits in a segment
selector are set.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Jesse Larrew [Mon, 10 Dec 2012 21:31:51 +0000 (15:31 -0600)]
x86: kvm_para: fix typo in hypercall comments
Correct a typo in the comment explaining hypercalls.
Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gleb Natapov [Mon, 24 Dec 2012 15:49:30 +0000 (17:49 +0200)]
KVM: move the code that installs new slots array to a separate function.
Move repetitive code sequence to a separate function.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:55 +0000 (19:10 +0200)]
KVM: VMX: remove unneeded temporary variable from vmx_set_segment()
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:54 +0000 (19:10 +0200)]
KVM: VMX: clean-up vmx_set_segment()
Move all vm86_active logic into one place.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:53 +0000 (19:10 +0200)]
KVM: VMX: remove redundant code from vmx_set_segment()
Segment descriptor's base is fixed by call to fix_rmode_seg(). Not need
to do it twice.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:52 +0000 (19:10 +0200)]
KVM: VMX: use fix_rmode_seg() to fix all code/data segments
The code for SS and CS does the same thing fix_rmode_seg() is doing.
Use it instead of hand crafted code.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:51 +0000 (19:10 +0200)]
KVM: VMX: return correct segment limit and flags for CS/SS registers in real mode
VMX without unrestricted mode cannot virtualize real mode, so if
emulate_invalid_guest_state=0 kvm uses vm86 mode to approximate
it. Sometimes, when guest moves from protected mode to real mode, it
leaves segment descriptors in a state not suitable for use by vm86 mode
virtualization, so we keep shadow copy of segment descriptors for internal
use and load fake register to VMCS for guest entry to succeed. Till
now we kept shadow for all segments except SS and CS (for SS and CS we
returned parameters directly from VMCS), but since commit
a5625189f6810
emulator enforces segment limits in real mode. This causes #GP during move
from protected mode to real mode when emulator fetches first instruction
after moving to real mode since it uses incorrect CS base and limit to
linearize the %rip. Fix by keeping shadow for SS and CS too.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:50 +0000 (19:10 +0200)]
KVM: VMX: relax check for CS register in rmode_segment_valid()
rmode_segment_valid() checks if segment descriptor can be used to enter
vm86 mode. VMX spec mandates that in vm86 mode CS register will be of
type data, not code. Lets allow guest entry with vm86 mode if the only
problem with CS register is incorrect type. Otherwise entire real mode
will be emulated.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Wed, 12 Dec 2012 17:10:49 +0000 (19:10 +0200)]
KVM: VMX: cleanup rmode_segment_valid()
Set segment fields explicitly instead of using binary operations.
No behaviour changes.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Alex Williamson [Fri, 21 Dec 2012 15:20:16 +0000 (08:20 -0700)]
kvm: Fix memory slot generation updates
Previous patch "kvm: Minor memory slot optimization" (
b7f69c555ca43)
overlooked the generation field of the memory slots. Re-using the
original memory slots left us with with two slightly different memory
slots with the same generation. To fix this, make update_memslots()
take a new parameter to specify the last generation. This also makes
generation management more explicit to avoid such problems in the future.
Reported-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Yang Zhang [Wed, 12 Dec 2012 05:05:12 +0000 (13:05 +0800)]
KVM: remove a wrong hack of delivery PIT intr to vcpu0
This hack is wrong. The pin number of PIT is connected to
2 not 0. This means this hack never takes effect. So it is ok
to remove it.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cornelia Huck [Fri, 14 Dec 2012 16:02:18 +0000 (17:02 +0100)]
KVM: s390: Add a channel I/O based virtio transport driver.
Add a driver for kvm guests that matches virtual ccw devices provided
by the host as virtio bridge devices.
These virtio-ccw devices use a special set of channel commands in order
to perform virtio functions.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cornelia Huck [Fri, 14 Dec 2012 16:02:17 +0000 (17:02 +0100)]
s390/ccwdev: Include asm/schid.h.
Get the definition of struct subchannel_id.
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cornelia Huck [Fri, 14 Dec 2012 16:02:16 +0000 (17:02 +0100)]
KVM: s390: Handle hosts not supporting s390-virtio.
Running under a kvm host does not necessarily imply the presence of
a page mapped above the main memory with the virtio information;
however, the code includes a hard coded access to that page.
Instead, check for the presence of the page and exit gracefully
before we hit an addressing exception if it does not exist.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
cc: stable@vger.kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Nickolai Zeldovich [Sat, 15 Dec 2012 11:34:37 +0000 (06:34 -0500)]
kvm: fix i8254 counter 0 wraparound
The kvm i8254 emulation for counter 0 (but not for counters 1 and 2)
has at least two bugs in mode 0:
1. The OUT bit, computed by pit_get_out(), is never set high.
2. The counter value, computed by pit_get_count(), wraps back around to
the initial counter value, rather than wrapping back to 0xFFFF
(which is the behavior described in the comment in __kpit_elapsed,
the behavior implemented by qemu, and the behavior observed on AMD
hardware).
The bug stems from __kpit_elapsed computing the elapsed time mod the
initial counter value (stored as nanoseconds in ps->period). This is both
unnecessary (none of the callers of kpit_elapsed expect the value to be
at most the initial counter value) and incorrect (it causes pit_get_count
to appear to wrap around to the initial counter value rather than 0xFFFF).
Removing this mod from __kpit_elapsed fixes both of the above bugs.
Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Gleb Natapov [Fri, 14 Dec 2012 13:23:16 +0000 (15:23 +0200)]
KVM: remove unused variable.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:38 +0000 (10:33 -0700)]
KVM: Increase user memory slots on x86 to 125
With the 3 private slots, this gives us a nice round 128 slots total.
The primary motivation for this is to support more assigned devices.
Each assigned device can theoretically use up to 8 slots (6 MMIO BARs,
1 ROM BAR, 1 spare for a split MSI-X table mapping) though it's far
more typical for a device to use 3-4 slots. If we assume a typical VM
uses a dozen slots for non-assigned devices purposes, we should always
be able to support 14 worst case assigned devices or 28 to 37 typical
devices.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:32 +0000 (10:33 -0700)]
KVM: struct kvm_memory_slot.id -> short
We're currently offering a whopping 32 memory slots to user space, an
int is a bit excessive for storing this. We would like to increase
our memslots, but SHRT_MAX should be more than enough.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:26 +0000 (10:33 -0700)]
KVM: struct kvm_memory_slot.flags -> u32
struct kvm_userspace_memory_region.flags is a u32 with a comment that
bits 0 ~ 15 are visible to userspace and the other bits are reserved
for kvm internal use. KVM_MEMSLOT_INVALID is the only internal use
flag and it has a comment that bits 16 ~ 31 are internally used and
the other bits are visible to userspace.
Therefore, let's define this as a u32 so we don't waste bytes on LP64
systems. Move to the end of the struct for alignment.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:21 +0000 (10:33 -0700)]
KVM: struct kvm_memory_slot.user_alloc -> bool
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:15 +0000 (10:33 -0700)]
KVM: Make KVM_PRIVATE_MEM_SLOTS optional
Seems like everyone copied x86 and defined 4 private memory slots
that never actually get used. Even x86 only uses 3 of the 4. These
aren't exposed so there's no need to add padding.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:09 +0000 (10:33 -0700)]
KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is
the user accessible slots and the other is user + private. Make this
more obvious.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:33:03 +0000 (10:33 -0700)]
KVM: Minor memory slot optimization
If a slot is removed or moved in the guest physical address space, we
first allocate and install a new slot array with the invalidated
entry. The old array is then freed. We then proceed to allocate yet
another slot array to install the permanent replacement. Re-use the
original array when this occurs and avoid the extra kfree/kmalloc.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:32:57 +0000 (10:32 -0700)]
KVM: Fix iommu map/unmap to handle memory slot moves
The iommu integration into memory slots expects memory slots to be
added or removed and doesn't handle the move case. We can unmap
slots from the iommu after we mark them invalid and map them before
installing the final memslot array. Also re-order the kmemdup vs
map so we don't leave iommu mappings if we get ENOMEM.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:32:51 +0000 (10:32 -0700)]
KVM: Check userspace_addr when modifying a memory slot
The API documents that only flags and guest physical memory space can
be modified on an existing slot, but we don't enforce that the
userspace address cannot be modified. Instead we just ignore it.
This means that a user may think they've successfully moved both the
guest and user addresses, when in fact only the guest address changed.
Check and error instead.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Mon, 10 Dec 2012 17:32:45 +0000 (10:32 -0700)]
KVM: Restrict non-existing slot state transitions
The API documentation states:
When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified.
An "existing slot" requires a non-zero npages (memory_size). The only
transition we should therefore allow for a non-existing slot should be
to create the slot, which includes setting a non-zero memory_size. We
currently allow calls to modify non-existing slots, which is pointless,
confusing, and possibly wrong.
With this we know that the invalidation path of __kvm_set_memory_region
is always for a delete or move and never for adding a zero size slot.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gleb Natapov [Mon, 10 Dec 2012 12:05:55 +0000 (14:05 +0200)]
KVM: inject ExtINT interrupt before APIC interrupts
According to Intel SDM Volume 3 Section 10.8.1 "Interrupt Handling with
the Pentium 4 and Intel Xeon Processors" and Section 10.8.2 "Interrupt
Handling with the P6 Family and Pentium Processors" ExtINT interrupts are
sent directly to the processor core for handling. Currently KVM checks
APIC before it considers ExtINT interrupts for injection which is
backwards from the spec. Make code behave according to the SDM.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: "Zhang, Yang Z" <yang.z.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Nadav Amit [Thu, 6 Dec 2012 23:55:10 +0000 (21:55 -0200)]
KVM: x86: fix mov immediate emulation for 64-bit operands
MOV immediate instruction (opcodes 0xB8-0xBF) may take 64-bit operand.
The previous emulation implementation assumes the operand is no longer than 32.
Adding OpImm64 for this matter.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=881579
Signed-off-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gleb Natapov [Mon, 10 Dec 2012 09:42:30 +0000 (11:42 +0200)]
KVM: emulator: implement AAD instruction
Windows2000 uses it during boot. This fixes
https://bugzilla.kernel.org/show_bug.cgi?id=50921
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Linus Torvalds [Thu, 13 Dec 2012 23:31:08 +0000 (15:31 -0800)]
Merge tag 'kvm-3.8-1' of git://git./virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
"Considerable KVM/PPC work, x86 kvmclock vsyscall support,
IA32_TSC_ADJUST MSR emulation, amongst others."
Fix up trivial conflict in kernel/sched/core.c due to cross-cpu
migration notifier added next to rq migration call-back.
* tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits)
KVM: emulator: fix real mode segment checks in address linearization
VMX: remove unneeded enable_unrestricted_guest check
KVM: VMX: fix DPL during entry to protected mode
x86/kexec: crash_vmclear_local_vmcss needs __rcu
kvm: Fix irqfd resampler list walk
KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
KVM: MMU: optimize for set_spte
KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
KVM: PPC: bookehv: Add guest computation mode for irq delivery
KVM: PPC: Make EPCR a valid field for booke64 and bookehv
KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
KVM: PPC: e500: Add emulation helper for getting instruction ea
KVM: PPC: bookehv64: Add support for interrupt handling
KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
KVM: PPC: booke: Fix get_tb() compile error on 64-bit
KVM: PPC: e500: Silence bogus GCC warning in tlb code
...
Linus Torvalds [Thu, 13 Dec 2012 22:29:16 +0000 (14:29 -0800)]
Merge tag 'stable/for-linus-3.8-rc0-tag' of git://git./linux/kernel/git/konrad/xen
Pull Xen updates from Konrad Rzeszutek Wilk:
- Add necessary infrastructure to make balloon driver work under ARM.
- Add /dev/xen/privcmd interfaces to work with ARM and PVH.
- Improve Xen PCIBack wild-card parsing.
- Add Xen ACPI PAD (Processor Aggregator) support - so can offline/
online sockets depending on the power consumption.
- PVHVM + kexec = use an E820_RESV region for the shared region so we
don't overwrite said region during kexec reboot.
- Cleanups, compile fixes.
Fix up some trivial conflicts due to the balloon driver now working on
ARM, and there were changes next to the previous work-arounds that are
now gone.
* tag 'stable/for-linus-3.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/PVonHVM: fix compile warning in init_hvm_pv_info
xen: arm: implement remap interfaces needed for privcmd mappings.
xen: correctly use xen_pfn_t in remap_domain_mfn_range.
xen: arm: enable balloon driver
xen: balloon: allow PVMMU interfaces to be compiled out
xen: privcmd: support autotranslated physmap guests.
xen: add pages parameter to xen_remap_domain_mfn_range
xen/acpi: Move the xen_running_on_version_or_later function.
xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
xen/acpi: Fix compile error by missing decleration for xen_domain.
xen/acpi: revert pad config check in xen_check_mwait
xen/acpi: ACPI PAD driver
xen-pciback: reject out of range inputs
xen-pciback: simplify and tighten parsing of device IDs
xen PVonHVM: use E820_Reserved area for shared_info
Linus Torvalds [Thu, 13 Dec 2012 22:20:19 +0000 (14:20 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull s390 update from Martin Schwidefsky:
"Add support to generate code for the latest machine zEC12, MOD and XOR
instruction support for the BPF jit compiler, the dasd safe offline
feature and the big one: the s390 architecture gets PCI support!!
Right before the world ends on the 21st ;-)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
s390/qdio: rename the misleading PCI flag of qdio devices
s390/pci: remove obsolete email addresses
s390/pci: speed up __iowrite64_copy by using pci store block insn
s390/pci: enable NEED_DMA_MAP_STATE
s390/pci: no msleep in potential IRQ context
s390/pci: fix potential NULL pointer dereference in dma_free_seg_table()
s390/pci: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
s390/bpf,jit: add support for XOR instruction
s390/bpf,jit: add support MOD instruction
s390/cio: fix pgid reserved check
vga: compile fix, disable vga for s390
s390/pci: add PCI Kconfig options
s390/pci: s390 specific PCI sysfs attributes
s390/pci: PCI hotplug support via SCLP
s390/pci: CHSC PCI support for error and availability events
s390/pci: DMA support
s390/pci: PCI adapter interrupts for MSI/MSI-X
s390/bitops: find leftmost bit instruction support
s390/pci: CLP interface
s390/pci: base support
...
Linus Torvalds [Thu, 13 Dec 2012 21:23:33 +0000 (13:23 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven.
Fix up trivial conflict (m68k switched to generic version of
uapi/asm/socket.h, net tree updated the old one) as per Geert.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k/sun3: Fix instruction faults
m68k/sun3: Get interrupts working again
m68k: move to a single instance of free_initmem()
m68k: merge MMU and non-MMU versions of mm/init.c
m68k: switch to using the asm-generic termios.h
m68k: switch to using the asm-generic termbits.h
m68k: switch to using the asm-generic sockios.h
m68k: switch to using the asm-generic socket.h
m68k: switch to using the asm-generic shmbuf.h
m68k: switch to using the asm-generic sembuf.h
m68k: switch to using the asm-generic msgbuf.h
m68k: switch to using the asm-generic auxvec.h
m68k: switch to using the asm-generic shmparam.h
m68k: switch to using the asm-generic spinlock.h
m68k: switch to using the asm-generic hw_irq.h
arch/m68k: remove CONFIG_EXPERIMENTAL
Linus Torvalds [Thu, 13 Dec 2012 21:21:19 +0000 (13:21 -0800)]
Merge git://git./linux/kernel/git/davem/sparc-next
Pull tiny sparc update from David Miller:
"Not much going on this release cycle in sparc land, just a Kconfig
tweak."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
of_i2c: sparc: Allow OF_I2C for sparc
Linus Torvalds [Thu, 13 Dec 2012 21:20:02 +0000 (13:20 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"A pile of fixes in response to yesterday's big merge. The SCTP HMAC
thing hasn't been addressed yet, I'll take care of that myself if Neil
and Vlad don't show signs of life by tomorrow.
1) Use after free of SKB in tuntap code. Fix by Eric Dumazet,
reported by Dave Jones.
2) NFC LLCP code emits annoying kernel log message, triggerable by
the user. From Dave Jones.
3) Fix several endianness bugs noticed by sparse in the bridging
code, from Stephen Hemminger.
4) Ipv6 NDISC code doesn't take padding into account properly, fix
from YOSHIFUJI Hideaki.
5) Add missing docs to ethtool_flow_ext struct, from Yan Burman."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bridge: fix icmpv6 endian bug and other sparse warnings
net: ethool: Document struct ethtool_flow_ext
ndisc: Fix padding error in link-layer address option.
tuntap: dont use skb after netif_rx_ni(skb)
nfc: remove noisy message from llcp_sock_sendmsg
Linus Torvalds [Thu, 13 Dec 2012 21:11:15 +0000 (13:11 -0800)]
Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc VM changes from Andrew Morton:
"The rest of most-of-MM. The other MM bits await a slab merge.
This patch includes the addition of a huge zero_page. Not a
performance boost but it an save large amounts of physical memory in
some situations.
Also a bunch of Fujitsu engineers are working on memory hotplug.
Which, as it turns out, was badly broken. About half of their patches
are included here; the remainder are 3.8 material."
However, this merge disables CONFIG_MOVABLE_NODE, which was totally
broken. We don't add new features with "default y", nor do we add
Kconfig questions that are incomprehensible to most people without any
help text. Does the feature even make sense without compaction or
memory hotplug?
* akpm: (54 commits)
mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
mm/memory.c: remove unused code from do_wp_page()
asm-generic, mm: pgtable: consolidate zero page helpers
mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
hwpoison, hugetlbfs: fix RSS-counter warning
hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
mm: protect against concurrent vma expansion
memcg: do not check for mm in __mem_cgroup_count_vm_event
tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
mm: provide more accurate estimation of pages occupied by memmap
fs/buffer.c: remove redundant initialization in alloc_page_buffers()
fs/buffer.c: do not inline exported function
writeback: fix a typo in comment
mm: introduce new field "managed_pages" to struct zone
mm, oom: remove statically defined arch functions of same name
mm, oom: remove redundant sleep in pagefault oom handler
mm, oom: cleanup pagefault oom handler
memory_hotplug: allow online/offline memory to result movable node
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
mm, memcg: avoid unnecessary function call when memcg is disabled
...
Linus Torvalds [Thu, 13 Dec 2012 20:14:47 +0000 (12:14 -0800)]
Merge tag 'for-3.8' of git://git./linux/kernel/git/helgaas/pci
Pull PCI update from Bjorn Helgaas:
"Host bridge hotplug:
- Untangle _PRT from struct pci_bus (Bjorn Helgaas)
- Request _OSC control before scanning root bus (Taku Izumi)
- Assign resources when adding host bridge (Yinghai Lu)
- Remove root bus when removing host bridge (Yinghai Lu)
- Remove _PRT during hot remove (Yinghai Lu)
SRIOV
- Add sysfs knobs to control numVFs (Don Dutile)
Power management
- Notify devices when power resource turned on (Huang Ying)
Bug fixes
- Work around broken _SEG on HP xw9300 (Bjorn Helgaas)
- Keep runtime PM enabled for unbound PCI devices (Huang Ying)
- Fix Optimus dual-GPU runtime D3 suspend issue (Dave Airlie)
- Fix xen frontend shutdown issue (David Vrabel)
- Work around PLX PCI 9050 BAR alignment erratum (Ian Abbott)
Miscellaneous
- Add GPL license for drivers/pci/ioapic (Andrew Cooks)
- Add standard PCI-X, PCIe ASPM register #defines (Bjorn Helgaas)
- NumaChip remote PCI support (Daniel Blueman)
- Fix PCIe Link Capabilities Supported Link Speed definition (Jingoo
Han)
- Convert dev_printk() to dev_info(), etc (Joe Perches)
- Add support for non PCI BAR ROM data (Matthew Garrett)
- Add x86 support for host bridge translation offset (Mike Yoknis)
- Report success only when every driver supports AER (Vijay
Pandarathil)"
Fix up trivial conflicts.
* tag 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (48 commits)
PCI: Use phys_addr_t for physical ROM address
x86/PCI: Add NumaChip remote PCI support
ath9k: Use standard #defines for PCIe Capability ASPM fields
iwlwifi: Use standard #defines for PCIe Capability ASPM fields
iwlwifi: collapse wrapper for pcie_capability_read_word()
iwlegacy: Use standard #defines for PCIe Capability ASPM fields
iwlegacy: collapse wrapper for pcie_capability_read_word()
cxgb3: Use standard #defines for PCIe Capability ASPM fields
PCI: Add standard PCIe Capability Link ASPM field names
PCI/portdrv: Use PCI Express Capability accessors
PCI: Use standard PCIe Capability Link register field names
x86: Use PCI setup data
PCI: Add support for non-BAR ROMs
PCI: Add pcibios_add_device
EFI: Stash ROMs if they're not in the PCI BAR
PCI: Add and use standard PCI-X Capability register names
PCI/PM: Keep runtime PM enabled for unbound PCI devices
xen-pcifront: Handle backend CLOSED without CLOSING
PCI: SRIOV control and status via sysfs (documentation)
PCI/AER: Report success only when every device has AER-aware driver
...
Linus Torvalds [Thu, 13 Dec 2012 20:04:35 +0000 (12:04 -0800)]
Merge tag 'regulator-3.8' of git://git./linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"A fairly quiet release again, a couple of relatively small new
features and a bunch of driver specific work including yet more code
elimination and fixes from Axel Lin.
- Addidion of linear_min_sel for offsetting linear selectors in the
helpers.
- Support for continuous voltage ranges for regulators with extremely
high resolution.
- Drivers for AS3711, DA9055, MAX9873, TPS51632, TPS80031 and ARM
vexpress."
Fix up trivial conflict (due to typo fix) in palmas-regulator.c
* tag 'regulator-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (80 commits)
regulator: core: Fix logic to determinate if regulator can change voltage
regulator: s5m8767: Fix to work even if no DVS gpio present
regulator: s5m8767: Fix to read the first DVS register.
regulator: s5m8767: Fix to work when platform registers less regulators
regulator: gpio-regulator: gpio_set_value should use cansleep
regulator: gpio-regulator: Fix logical error in for() loop
regulator: anatop: Use regulator_[get|set]_voltage_sel_regmap
regulator: anatop: Use linear_min_sel with linear mapping
regulator: max1586: Implement get_voltage_sel callback
regulator: lp8788-buck: Kill _gpio_request function
regulator: tps80031: Convert tps80031_ldo_ops to linear_min_sel and list_voltage_linear
regulator: lp8788-ldo: Remove val array in lp8788_config_ldo_enable_mode
regulator: gpio-regulator: Add ifdef CONFIG_OF guard for regulator_gpio_of_match
regulator: palmas: Convert palmas_ops_smps to regulator_[get|set]_voltage_sel_regmap
regulator: palmas: Return raw register values as the selectors in [get|set]_voltage_sel
regulators: add regulator_can_change_voltage() function
regulator: tps51632: Ensure [base|max]_voltage_uV pdata settings are valid
regulator: wm831x-dcdc: Add MODULE_ALIAS for wm831x-boostp
regulator: wm831x-dcdc: Ensure selected voltage falls within requested range
regulator: tps51632: Use linear_min_sel and regulator_[map|list]_voltage_linear
...
Linus Torvalds [Thu, 13 Dec 2012 20:00:48 +0000 (12:00 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID subsystem updates from Jiri Kosina:
1) Support for HID over I2C bus has been added by Benjamin Tissoires.
ACPI device discovery is still in the works.
2) Support for Win8 Multitiouch protocol is being added, most work done
by Benjamin Tissoires as well
3) EIO/ERESTARTSYS is fixed in hiddev/hidraw, fixes by Andrew Duggan
and Jiri Kosina
4) ION iCade driver added by Bastien Nocera
5) Support for a couple new Roccat devices has been added by Stefan
Achatz
6) HID sensor hubs are now auto-detected instead of having to list all
the VID/PID combinations in the blacklist array
7) other random fixes and support for new device IDs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (65 commits)
HID: i2c-hid: add mutex protecting open/close race
Revert "HID: sensors: add to special driver list"
HID: sensors: autodetect USB HID sensor hubs
HID: hidp: fallback to input session properly if hid is blacklisted
HID: i2c-hid: fix ret_count check
HID: i2c-hid: fix i2c_hid_get_raw_report count mismatches
HID: i2c-hid: remove extra .irq field in struct i2c_hid
HID: i2c-hid: reorder allocation/free of buffers
HID: i2c-hid: fix memory corruption due to missing hid declaration
HID: i2c-hid: remove superfluous include
HID: i2c-hid: remove unneeded test in i2c_hid_remove
HID: i2c-hid: i2c_hid_get_report may fail
HID: i2c-hid: also call i2c_hid_free_buffers in i2c_hid_remove
HID: i2c-hid: fix error messages
HID: i2c-hid: fix return paths
HID: i2c-hid: remove unused static declarations
HID: i2c-hid: fix i2c_hid_dbg macro
HID: i2c-hid: fix checkpatch.pl warning
HID: i2c-hid: enhance Kconfig
HID: i2c-hid: change I2C name
...
Linus Torvalds [Thu, 13 Dec 2012 20:00:02 +0000 (12:00 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/trivial
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
Linus Torvalds [Thu, 13 Dec 2012 19:59:27 +0000 (11:59 -0800)]
Merge tag 'firewire-updates' of git://git./linux/kernel/git/ieee1394/linux1394
Pull IEEE 1394 (FireWire) subsystem updates from Stefan Richter:
- IPv4-over-1394: fixes for broadcast and multicast
- SBP-2: allow thin-provisioning related commands
- trivia
* tag 'firewire-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: net: remove unused variable in fwnet_receive_broadcast()
firewire: net: Fix handling of fragmented multicast/broadcast packets.
firewire: sbp2: allow WRITE SAME and REPORT SUPPORTED OPERATION CODES
tools/firewire: nosy-dump: check for allocation failure
Linus Torvalds [Thu, 13 Dec 2012 19:51:23 +0000 (11:51 -0800)]
Merge tag 'sound-3.8' of git://git./linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"This update contains a fairly wide range of changes all over in sound
subdirectory, mainly because of UAPI header moves by David and __dev*
annotation removals by Bill. Other highlights are:
- Introduced the support for wallclock timestamps in ALSA PCM core
- Add the poll loop implementation for HD-audio jack detection
- Yet more VGA-switcheroo fixes for HD-audio
- New VIA HD-audio codec support
- More fixes on resource management in USB audio and MIDI drivers
- More quirks for USB-audio ASUS Xonar U3, Reloop Play, Focusrite,
Roland VG-99, etc
- Add support for FastTrack C400 usb-audio
- Clean ups in many drivers regarding firmware loading
- Add PSC724 Ultiimate Edge support to ice1712
- A few hdspm driver updates
- New Stanton SCS.1d/1m FireWire driver
- Standardisation of the logging in ASoC codes
- DT and dmaengine support for ASoC Atmel
- Support for Wolfson ADSP cores
- New drivers for Freescale/iVeia P1022 and Maxim MAX98090
- Lots of other ASoC driver fixes and developments"
Fix up trivial conflicts. And go out on a limb and assume the dts file
'status' field of one of the conflicting things was supposed to be
"disabled", not "disable" like in pretty much all other cases.
* tag 'sound-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (341 commits)
ALSA: hda - Move runtime PM check to runtime_idle callback
ALSA: hda - Add stereo-dmic fixup for Acer Aspire One 522
ALSA: hda - Avoid doubly suspend after vga switcheroo
ALSA: usb-audio: Enable S/PDIF on the ASUS Xonar U3
ALSA: hda - Check validity of CORB/RIRB WP reads
ALSA: hda - use usleep_range in link reset and change timeout check
ALSA: HDA: VIA: Add support for codec VT1808.
ALSA: HDA: VIA Add support for codec VT1705CF.
ASoC: codecs: remove __dev* attributes
ASoC: utils: remove __dev* attributes
ASoC: ux500: remove __dev* attributes
ASoC: txx9: remove __dev* attributes
ASoC: tegra: remove __dev* attributes
ASoC: spear: remove __dev* attributes
ASoC: sh: remove __dev* attributes
ASoC: s6000: remove __dev* attributes
ASoC: OMAP: remove __dev* attributes
ASoC: nuc900: remove __dev* attributes
ASoC: mxs: remove __dev* attributes
ASoC: kirkwood: remove __dev* attributes
...
Linus Torvalds [Thu, 13 Dec 2012 19:00:00 +0000 (11:00 -0800)]
Merge tag 'boards2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC board updates, take 2 from Olof Johansson:
"This branch contains board updates for shmobile that had dependencies
on earlier branches past the first driver branch, and thus are merged
separately.
Most of these are to enable audio and USB on shmobile. They contain a
dependent ASoC branch that has been coordinated with Mark Brown."
* tag 'boards2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: shmobile: mackerel: Add FLCTL IRQ resource
ARM: shmobile: use FSI driver's audio clock on ap4evb
ARM: shmobile: use FSI driver's audio clock on mackerel
ARM: shmobile: use FSI driver's audio clock on armadillo800eva
ARM: shmobile: mackerel: enable DMAEngine on USB Host
ARM: shmobile: marzen: add USB OHCI driver support
ARM: shmobile: marzen: add USB EHCI driver support
ARM: shmobile: marzen: add USB phy support
ASoC: fsi: add master clock control functions
ASoC: fsi: care fsi_hw_start/stop() return value
ASoC: fsi: fsi_set_master_clk() was called from fsi_hw_xxx() only
ASoC: fsi: use devm_request_irq()
ASoC: fsi: fixup channels_min/max
Linus Torvalds [Thu, 13 Dec 2012 18:59:11 +0000 (10:59 -0800)]
Merge tag 'drivers' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC driver specific changes from Olof Johansson:
"A collection of mostly SoC-specific driver updates:
- a handful of pincontrol and setup changes
- new drivers for hwmon and reset controller for vexpress
- timing support updates for OMAP (gpmc and other interfaces)
- plus a collection of smaller cleanups"
* tag 'drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (21 commits)
ARM: ux500: fix pin warning
ARM: OMAP2+: tusb6010: generic timing calculation
ARM: OMAP2+: smc91x: generic timing calculation
ARM: OMAP2+: onenand: generic timing calculation
ARM: OMAP2+: gpmc: generic timing calculation
ARM: OMAP2+: gpmc: handle additional timings
ARM: OMAP2+: nand: remove redundant rounding
gpio: samsung: use pr_* instead of printk
ARM: ux500: fixup magnetometer pins
ARM: ux500: add STM pin configuration
ARM: ux500: 8500: add pinctrl support for uart1 and uart2
ARM: ux500: cosmetic fixups for uart0
gpio: samsung: Fix input mode setting function for GPIO int
ARM: SAMSUNG: Insert bitmap_gpio_int member in samsung_gpio_chip
ARM: ux500: 8500: define SDI sleep states
ARM: vexpress: Reset driver
ARM: ux500: 8500: update SKE keypad pinctrl table
hwmon: Versatile Express hwmon driver
ARM: ux500: delete duplicate macro
ARM: ux500: 8500: add IDLE pin configuration for SPI
...
Linus Torvalds [Thu, 13 Dec 2012 18:58:20 +0000 (10:58 -0800)]
Merge tag 'pm-merge' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC power management and clock changes from Olof Johansson:
"This branch contains a largeish set of updates of power management and
clock setup. The bulk of it is for OMAP/AM33xx platforms, but also a
few around hotplug/suspend/resume on Exynos.
It includes a split-up of some of the OMAP clock data into separate
files which adds to the diffstat, but gross delta is fairly reasonable."
* tag 'pm-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (60 commits)
ARM: OMAP: Move plat-omap/dma-omap.h to include/linux/omap-dma.h
ASoC: OMAP: mcbsp fixes for enabling ARM multiplatform support
watchdog: OMAP: fixup for ARM multiplatform support
ARM: EXYNOS: Add flush_cache_all in suspend finisher
ARM: EXYNOS: Remove scu_enable from cpuidle
ARM: EXYNOS: Fix soft reboot hang after suspend/resume
ARM: EXYNOS: Add support for rtc wakeup
ARM: EXYNOS: fix the hotplug for Cortex-A15
ARM: OMAP2+: omap_device: Correct resource handling for DT boot
ARM: OMAP2+: hwmod: Add possibility to count hwmod resources based on type
ARM: OMAP2+: hwmod: Add support for per hwmod/module context lost count
ARM: OMAP2+: PRM: initialize some PRM functions early
ARM: OMAP2+: voltage: fixup oscillator handling when CONFIG_PM=n
ARM: OMAP4: USB: power down MUSB PHY during boot
ARM: OMAP2+: clock: Cleanup !CONFIG_COMMON_CLK parts
ARM: OMAP2xxx: clock: drop obsolete clock data
ARM: OMAP2: clock: Cleanup !CONFIG_COMMON_CLK parts
ARM: OMAP3+: DPLL: drop !CONFIG_COMMON_CLK sections
ARM: AM33xx: clock: drop obsolete clock data
ARM: OMAP3xxx: clk: drop obsolete clock data
...
Linus Torvalds [Thu, 13 Dec 2012 18:57:16 +0000 (10:57 -0800)]
Merge tag 'multiplatform' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC multiplatform conversion patches from Olof Johansson:
"Here are more patches in the progression towards multiplatform, sparse
irq conversions in particular.
Tegra has a handful of cleanups and general groundwork, but is not
quite there yet on full enablement.
Platforms that are enabled through this branch are VT8500 and Zynq.
Note that i.MX was converted in one of the earlier cleanup branches as
well (before we started a separate topic for multiplatform). And both
new platforms for this merge window, sunxi and bcm, were merged with
multiplatform support enabled."
Fix up conflicts mostly as per Olof.
* tag 'multiplatform' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (29 commits)
ARM: zynq: Remove all unused mach headers
ARM: zynq: add support for ARCH_MULTIPLATFORM
ARM: zynq: make use of debug_ll_io_init()
ARM: zynq: remove TTC early mapping
ARM: tegra: move debug-macro.S to include/debug
ARM: tegra: don't include iomap.h from debug-macro.S
ARM: tegra: decouple uncompress.h and debug-macro.S
ARM: tegra: simplify DEBUG_LL UART selection options
ARM: tegra: select SPARSE_IRQ
ARM: tegra: enhance timer.c to get IO address from device tree
ARM: tegra: enhance timer.c to get IRQ info from device tree
ARM: timer: fix checkpatch warnings
ARM: tegra: add TWD to device tree
ARM: tegra: define DT bindings for and instantiate RTC
ARM: tegra: define DT bindings for and instantiate timer
clocksource/mtu-nomadik: use apb_pclk
clk: ux500: Register mtu apb_pclocks
ARM: plat-nomadik: convert platforms to SPARSE_IRQ
mfd/db8500-prcmu: use the irq_domain_add_simple()
mfd/ab8500-core: use irq_domain_add_simple()
...
Linus Torvalds [Thu, 13 Dec 2012 18:39:26 +0000 (10:39 -0800)]
Merge tag 'dt' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC device tree conversions and enablement from Olof Johansson:
"Continued device tree conversion and enablement across a number of
platforms; Kirkwood, tegra, i.MX, Exynos, zynq and a couple of other
smaller series as well.
ux500 has seen continued conversion for platforms. Several platforms
have seen pinctrl-via-devicetree conversions for simpler
multiplatform. Tegra is adding data for new devices/drivers, and
Exynos has a bunch of new bindings and devices added as well.
So, pretty much the same progression in the right direction as the
last few releases."
Fix up conflicts as per Olof.
* tag 'dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (185 commits)
ARM: ux500: Rename dbx500 cpufreq code to be more generic
ARM: dts: add missing ux500 device trees
ARM: ux500: Stop registering the PCM driver from platform code
ARM: ux500: Move board specific GPIO info out to subordinate DTS files
ARM: ux500: Disable the MMCI gpio-regulator by default
ARM: Kirkwood: remove kirkwood_ehci_init() from new boards
ARM: Kirkwood: Add support LED of OpenBlocks A6
ARM: Kirkwood: Convert to EHCI via DT for OpenBlocks A6
ARM: kirkwood: Add NAND partiton map for OpenBlocks A6
ARM: kirkwood: Add support second I2C bus and RTC on OpenBlocks A6
ARM: kirkwood: Add support DT of second I2C bus
ARM: kirkwood: Convert mplcec4 board to pinctrl
ARM: Kirkwood: Convert km_kirkwood to pinctrl
ARM: Kirkwood: support 98DX412x kirkwoods with pinctrl
ARM: Kirkwood: Convert IX2-200 to pinctrl.
ARM: Kirkwood: Convert lsxl boards to pinctrl.
ARM: Kirkwood: Convert ib62x0 to pinctrl.
ARM: Kirkwood: Convert GoFlex Net to pinctrl.
ARM: Kirkwood: Convert dreamplug to pinctrl.
ARM: Kirkwood: Convert dockstar to pinctrl.
...
stephen hemminger [Thu, 13 Dec 2012 06:51:28 +0000 (06:51 +0000)]
bridge: fix icmpv6 endian bug and other sparse warnings
Fix the warnings reported by sparse on recent bridge multicast
changes. Mostly just rcu annotation issues but in this case
sparse found a real bug! The ICMPv6 mld2 query mrc
values is in network byte order.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yan Burman [Thu, 13 Dec 2012 05:20:59 +0000 (05:20 +0000)]
net: ethool: Document struct ethtool_flow_ext
Add documentation for struct ethtool_flow_ext especially in regard
to what flags are needed for which fields.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YOSHIFUJI Hideaki / 吉藤英明 [Thu, 13 Dec 2012 04:29:36 +0000 (04:29 +0000)]
ndisc: Fix padding error in link-layer address option.
If a natural number n exists where 2 + data_len <= 8n < 2 + data_len + pad,
post padding is not initialized correctly.
(Un)fortunately, the only type that requires pad is Infiniband,
whose pad is 2 and data_len is 20, and this logical error has not
become obvious, but it is better to fix.
Note that ndisc_opt_addr_space() handles the situation described
above correctly.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 12 Dec 2012 19:22:57 +0000 (19:22 +0000)]
tuntap: dont use skb after netif_rx_ni(skb)
On Wed, 2012-12-12 at 23:16 -0500, Dave Jones wrote:
> Since todays net merge, I see this when I start openvpn..
>
> general protection fault: 0000 [#1] PREEMPT SMP
> Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs iTCO_wdt iTCO_vendor_support snd_emu10k1 snd_util_mem snd_ac97_codec coretemp ac97_bus microcode snd_hwdep snd_seq pcspkr snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 snd_rawmidi mfd_core snd_seq_device snd e1000e soundcore emu10k1_gp gameport i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate firewire_ohci sata_sil firewire_core crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> CPU 0
> Pid: 1381, comm: openvpn Not tainted 3.7.0+ #14 /D975XBX
> RIP: 0010:[<
ffffffff815b54a4>] [<
ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0
> RSP: 0018:
ffff88007d0d9c48 EFLAGS:
00010206
> RAX:
000000000000055d RBX:
6b6b6b6b6b6b6b4b RCX:
1471030a0180040a
> RDX:
0000000000000005 RSI:
00000000ffffffe0 RDI:
ffff8800ba83fa80
> RBP:
ffff88007d0d9cb8 R08:
0000000000000000 R09:
0000000000000000
> R10:
0000000000000000 R11:
0000000000000101 R12:
ffff8800ba83fa80
> R13:
0000000000000008 R14:
ffff88007d0d9cc8 R15:
ffff8800ba83fa80
> FS:
00007f6637104800(0000) GS:
ffff8800bf600000(0000) knlGS:
0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
> CR2:
00007f563f5b01c4 CR3:
000000007d140000 CR4:
00000000000007f0
> DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
> DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
> Process openvpn (pid: 1381, threadinfo
ffff88007d0d8000, task
ffff8800a540cd60)
> Stack:
>
ffff8800ba83fa80 0000000000000296 0000000000000000 0000000000000000
>
ffff88007d0d9cc8 ffffffff815bcff4 ffff88007d0d9ce8 ffffffff815b1831
>
ffff88007d0d9ca8 00000000703f6364 ffff8800ba83fa80 0000000000000000
> Call Trace:
> [<
ffffffff815bcff4>] ? netif_rx+0x114/0x4c0
> [<
ffffffff815b1831>] ? skb_copy_datagram_from_iovec+0x61/0x290
> [<
ffffffff815b672a>] __skb_get_rxhash+0x1a/0xd0
> [<
ffffffffa03b9538>] tun_get_user+0x418/0x810 [tun]
> [<
ffffffff8135f468>] ? delay_tsc+0x98/0xf0
> [<
ffffffff8109605c>] ? __rcu_read_unlock+0x5c/0xa0
> [<
ffffffffa03b9a41>] tun_chr_aio_write+0x81/0xb0 [tun]
> [<
ffffffff81145011>] ? __buffer_unlock_commit+0x41/0x50
> [<
ffffffff811db917>] do_sync_write+0xa7/0xe0
> [<
ffffffff811dc01f>] vfs_write+0xaf/0x190
> [<
ffffffff811dc375>] sys_write+0x55/0xa0
> [<
ffffffff81705540>] tracesys+0xdd/0xe2
> Code: 41 8b 44 24 68 41 2b 44 24 6c 01 de 29 f0 83 f8 03 0f 8e a0 00 00 00 48 63 de 49 03 9c 24 e0 00 00 00 48 85 db 0f 84 72 fe ff ff <8b> 03 41 89 46 08 b8 01 00 00 00 e9 43 fd ff ff 0f 1f 40 00 48
> RIP [<
ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0
> RSP <
ffff88007d0d9c48>
> ---[ end trace
6d42c834c72c002e ]---
>
>
> Faulting instruction is
>
> 0: 8b 03 mov (%rbx),%eax
>
> rbx is slab poison (-20) so this looks like a use-after-free here...
>
> flow->ports = *ports;
> 314: 8b 03 mov (%rbx),%eax
> 316: 41 89 46 08 mov %eax,0x8(%r14)
>
> in the inlined skb_header_pointer in skb_flow_dissect
>
> Dave
>
commit
96442e4242 (tuntap: choose the txq based on rxq) added
a use after free.
Cache rxhash in a temp variable before calling netif_rx_ni()
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Jones [Wed, 12 Dec 2012 18:11:34 +0000 (18:11 +0000)]
nfc: remove noisy message from llcp_sock_sendmsg
This is easily triggerable when fuzz-testing as an unprivileged user.
We could rate-limit it, but given we don't print similar messages
for other protocols, I just removed it.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 13 Dec 2012 02:07:07 +0000 (18:07 -0800)]
Merge git://git./linux/kernel/git/davem/net-next
Pull networking changes from David Miller:
1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.
2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.
3) Networking user namespace support from Eric W. Biederman.
4) tuntap/virtio-net multiqueue support by Jason Wang.
5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.
6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.
7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.
8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.
9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.
10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.
11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.
12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.
13) Add "oops_only" mode to netconsole, from Amerigo Wang.
14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.
15) Support PTP in the Tigon3 driver, from Matt Carlson.
16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.
17) Support per-association statistics in SCTP, from Michele
Baldessari.
And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...
Linus Torvalds [Thu, 13 Dec 2012 01:50:34 +0000 (17:50 -0800)]
Merge tag 'for-linus-
20121212' of git://git./linux/kernel/git/dhowells/linux-mn10300
Pull MN10300 changes from David Howells:
"miscellaneous MN10300 arch patches. I've based it on top of Al Viro's
signal tree - so these patches should be pulled after that."
* tag 'for-linus-
20121212' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-mn10300:
MN10300: Use asm-generic/pci_iomap.h
MN10300: Get rid of unused variable from ASB2305 PCI code
MN10300: ASB2305 PCI code needs linux/irq.h
mn10300/mm/fault.c: Port OOM changes to do_page_fault
MN10300: Handle cacheable PCI regions in pci_iomap()
MN10300: fix debug polling in ttySM driver
MN10300: ttySM: clean up unnecessary casting
MN10300: fix SMP synchronization between txdma and serial driver
MN10300: fix serial port vdma irq setup for SMP
MN10300: cleanup IRQ affinity setting
MN10300: ttySM: Use memory barriers correctly in circular buffer logic
Lin Feng [Wed, 12 Dec 2012 21:52:39 +0000 (13:52 -0800)]
mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
reserve_bootmem_generic() has no caller,
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dominik Dingel [Wed, 12 Dec 2012 21:52:37 +0000 (13:52 -0800)]
mm/memory.c: remove unused code from do_wp_page()
page_mkwrite is initalized with zero and only set once, from that point
exists no way to get to the oom or oom_free_new labels.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:52:36 +0000 (13:52 -0800)]
asm-generic, mm: pgtable: consolidate zero page helpers
We have two different implementation of is_zero_pfn() and my_zero_pfn()
helpers: for architectures with and without zero page coloring.
Let's consolidate them in <asm-generic/pgtable.h>.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Dec 2012 21:52:33 +0000 (13:52 -0800)]
mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
Fix the warning from __list_del_entry() which is triggered when a process
tries to do free_huge_page() for a hwpoisoned hugepage.
free_huge_page() can be called for hwpoisoned hugepage from
unpoison_memory(). This function gets refcount once and clears
PageHWPoison, and then puts refcount twice to return the hugepage back to
free pool. The second put_page() finally reaches free_huge_page().
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Dec 2012 21:52:30 +0000 (13:52 -0800)]
hwpoison, hugetlbfs: fix RSS-counter warning
Memory error handling on hugepages can break a RSS counter, which emits a
message like "Bad rss-counter state mm:
ffff88040abecac0 idx:1 val:-1".
This is because PageAnon returns true for hugepage (this behavior is
necessary for reverse mapping to work on hugetlbfs).
[akpm@linux-foundation.org: clean up code layout]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Dec 2012 21:52:28 +0000 (13:52 -0800)]
hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
When a process which used a hwpoisoned hugepage tries to exit() or
munmap(), the kernel can print out "bad pmd" message because page table
walker in free_pgtables() encounters 'hwpoisoned entry' on pmd.
This is because currently we fail to clear the hwpoisoned entry in
__unmap_hugepage_range(), so this patch simply does it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel Lespinasse [Wed, 12 Dec 2012 21:52:25 +0000 (13:52 -0800)]
mm: protect against concurrent vma expansion
expand_stack() runs with a shared mmap_sem lock. Because of this, there
could be multiple concurrent stack expansions in the same mm, which may
cause problems in the vma gap update code.
I propose to solve this by taking the mm->page_table_lock around such vma
expansions, in order to avoid the concurrency issue. We only have to
worry about concurrent expand_stack() calls here, since we hold a shared
mmap_sem lock and all vma modificaitons other than expand_stack() are done
under an exclusive mmap_sem lock.
I previously tried to achieve the same effect by making sure all growable
vmas in a given mm would share the same anon_vma, which we already lock
here. However this turned out to be difficult - all of the schemes I
tried for refcounting the growable anon_vma and clearing turned out ugly.
So, I'm now proposing only the minimal fix.
The overhead of taking the page table lock during stack expansion is
expected to be small: glibc doesn't use expandable stacks for the threads
it creates, so having multiple growable stacks is actually uncommon and we
don't expect the page table lock to get bounced between threads.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 12 Dec 2012 21:52:23 +0000 (13:52 -0800)]
memcg: do not check for mm in __mem_cgroup_count_vm_event
The mm given to __mem_cgroup_count_vm_event() cannot be NULL because the
function is either called from the page fault path or vma->vm_mm is used.
So the check can be dropped.
The check was introduced by commit
456f998ec817 ("memcg: add the
pagefault count into memcg stats") because the originally proposed patch
used current->mm for shmem but this has been changed to vma->vm_mm later
on without the check being removed (thanks to Hugh for this
recollection).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 12 Dec 2012 21:52:21 +0000 (13:52 -0800)]
tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
Revert 3.5's commit
f21f8062201f ("tmpfs: revert SEEK_DATA and
SEEK_HOLE") to reinstate
4fb5ef089b28 ("tmpfs: support SEEK_DATA and
SEEK_HOLE"), with the intervening additional arg to
generic_file_llseek_size().
In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
SEEK_DATA and SEEK_HOLE support; and a good case has now been made for
it on tmpfs, so let's join the party.
It's quite easy for tmpfs to scan the radix_tree to support llseek's new
SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are
still on my mind (in particular, the !PageUptodate-ness of pages
fallocated but still unwritten).
[akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Paul Eggert <eggert@cs.ucla.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiang Liu [Wed, 12 Dec 2012 21:52:19 +0000 (13:52 -0800)]
mm: provide more accurate estimation of pages occupied by memmap
If SPARSEMEM is enabled, it won't build page structures for non-existing
pages (holes) within a zone, so provide a more accurate estimation of
pages occupied by memmap if there are bigger holes within the zone.
And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.
[akpm@linux-foundation.org: mark calc_memmap_size __paging_init]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Tested-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Hong [Wed, 12 Dec 2012 21:52:16 +0000 (13:52 -0800)]
fs/buffer.c: remove redundant initialization in alloc_page_buffers()
buffer_head comes from kmem_cache_zalloc(), no need to zero its fields.
Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Hong [Wed, 12 Dec 2012 21:52:15 +0000 (13:52 -0800)]
fs/buffer.c: do not inline exported function
It makes no sense to inline an exported function.
Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Hong [Wed, 12 Dec 2012 21:52:14 +0000 (13:52 -0800)]
writeback: fix a typo in comment
Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiang Liu [Wed, 12 Dec 2012 21:52:12 +0000 (13:52 -0800)]
mm: introduce new field "managed_pages" to struct zone
Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
spanned_pages - absent_pages - memmap_pages - dma_reserve.
During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages may
have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.
For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/
This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.
We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.
For DMA/normal zones, the initial value is set to:
(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().
The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.
This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.
[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:52:10 +0000 (13:52 -0800)]
mm, oom: remove statically defined arch functions of same name
out_of_memory() is a globally defined function to call the oom killer.
x86, sh, and powerpc all use a function of the same name within file scope
in their respective fault.c unnecessarily. Inline the functions into the
pagefault handlers to clean the code up.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:52:07 +0000 (13:52 -0800)]
mm, oom: remove redundant sleep in pagefault oom handler
out_of_memory() will already cause current to schedule if it has not been
killed, so doing it again in pagefault_out_of_memory() is redundant.
Remove it.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:52:06 +0000 (13:52 -0800)]
mm, oom: cleanup pagefault oom handler
To lock the entire system from parallel oom killing, it's possible to pass
in a zonelist with all zones rather than using for_each_populated_zone()
for the iteration. This obsoletes try_set_system_oom() and
clear_system_oom() so that they can be removed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:52:04 +0000 (13:52 -0800)]
memory_hotplug: allow online/offline memory to result movable node
Now, memory management can handle movable node or nodes which don't have
any normal memory, so we can dynamic configure and add movable node by:
online a ZONE_MOVABLE memory from a previous offline node
offline the last normal memory which result a non-normal-memory-node
movable-node is very important for power-saving, hardware partitioning and
high-available-system(hardware fault management).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:52:00 +0000 (13:52 -0800)]
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
We need a node which only contains movable memory. This feature is very
important for node hotplug. If a node has normal/highmem, the memory may
be used by the kernel and can't be offlined. If the node only contains
movable memory, we can offline the memory and the node.
All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node
[akpm@linux-foundation.org: fix Kconfig text]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 12 Dec 2012 21:51:57 +0000 (13:51 -0800)]
mm, memcg: avoid unnecessary function call when memcg is disabled
While profiling numa/core v16 with cgroup_disable=memory on the command
line, I noticed mem_cgroup_count_vm_event() still showed up as high as
0.60% in perftop.
This occurs because the function is called extremely often even when memcg
is disabled.
To fix this, inline the check for mem_cgroup_disabled() so we avoid the
unnecessary function call if memcg is disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Wed, 12 Dec 2012 21:51:56 +0000 (13:51 -0800)]
mm: add a reminder comment for __GFP_BITS_SHIFT
Cc: Glauber Costa <glommer@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Wed, 12 Dec 2012 21:51:53 +0000 (13:51 -0800)]
mm: WARN_ON_ONCE if f_op->mmap() change vma's start address
During reviewing the source code, I found a comment which mention that
after f_op->mmap(), vma's start address can be changed. I didn't verify
that it is really possible, because there are so many f_op->mmap()
implementation. But if there are some mmap() which change vma's start
address, it is possible error situation, because we already prepare prev
vma, rb_link and rb_parent and these are related to original address.
So add WARN_ON_ONCE for finding that this situtation really happens.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Wed, 12 Dec 2012 21:51:52 +0000 (13:51 -0800)]
res_counter: delete res_counter_write()
Since commit
628f42355389 ("memcg: limit change shrink usage") both
res_counter_write() and write_strategy_fn have been unused. This patch
deletes them both.
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:49 +0000 (13:51 -0800)]
hotplug: update nodemasks management
Update nodemasks management for N_MEMORY.
[lliubbo@gmail.com: fix build]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:46 +0000 (13:51 -0800)]
page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Since we introduced N_MEMORY, we update the initialization of node_states.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:43 +0000 (13:51 -0800)]
vmscan: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:40 +0000 (13:51 -0800)]
init: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:39 +0000 (13:51 -0800)]
kthread: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:37 +0000 (13:51 -0800)]
vmstat: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:36 +0000 (13:51 -0800)]
hugetlb: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:33 +0000 (13:51 -0800)]
mempolicy: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:30 +0000 (13:51 -0800)]
mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:28 +0000 (13:51 -0800)]
oom: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:27 +0000 (13:51 -0800)]
memcontrol: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:25 +0000 (13:51 -0800)]
procfs: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:24 +0000 (13:51 -0800)]
cpuset: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Wed, 12 Dec 2012 21:51:21 +0000 (13:51 -0800)]
mm: node_states: introduce N_MEMORY
We have N_NORMAL_MEMORY for standing for the nodes that have normal memory
with zone_type <= ZONE_NORMAL.
And we have N_HIGH_MEMORY for standing for the nodes that have normal or
high memory.
But we don't have any word to stand for the nodes that have *any* memory.
And we have N_CPU but without N_MEMORY.
Current code reuse the N_HIGH_MEMORY for this purpose because any node
which has memory must have high memory or normal memory currently.
A) But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:
A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)
The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.
A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:
A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:
One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {
It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.
The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);
It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.
B) This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.
In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.
The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.
In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late
patches.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marek Szyprowski [Wed, 12 Dec 2012 21:51:19 +0000 (13:51 -0800)]
mm: use migrate_prep() instead of migrate_prep_local()
__alloc_contig_migrate_range() should use all possible ways to get all the
pages migrated from the given memory range, so pruning per-cpu lru lists
for all CPUs is required, regadless the cost of such operation. Otherwise
some pages which got stuck at per-cpu lru list might get missed by
migration procedure causing the contiguous allocation to fail.
Reported-by: SeongHwan Yoon <sunghwan.yun@samsung.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thierry Reding [Wed, 12 Dec 2012 21:51:17 +0000 (13:51 -0800)]
mm: compaction: Fix compiler warning
compact_capture_page() is only used if compaction is enabled so it should
be moved into the corresponding #ifdef.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:14 +0000 (13:51 -0800)]
thp: avoid race on multiple parallel page faults to the same page
pmd value is stable only with mm->page_table_lock taken. After taking
the lock we need to check that nobody modified the pmd before changing it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:12 +0000 (13:51 -0800)]
thp: introduce sysfs knob to disable huge zero page
By default kernel tries to use huge zero page on read page fault. It's
possible to disable huge zero page by writing 0 or enable it back by
writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:09 +0000 (13:51 -0800)]
thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events
hzp_alloc is incremented every time a huge zero page is successfully
allocated. It includes allocations which where dropped due
race with other allocation. Note, it doesn't count every map
of the huge zero page, only its allocation.
hzp_alloc_failed is incremented if kernel fails to allocate huge zero
page and falls back to using small pages.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:06 +0000 (13:51 -0800)]
thp: implement refcounting for huge zero page
H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.
We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.
If counter is 0, get_huge_zero_page() allocates a new huge page and takes
two references: one for caller and one for shrinker. We free the page
only in shrinker callback if counter is 1 (only shrinker has the
reference).
put_huge_zero_page() only decrements counter. Counter is never zero in
put_huge_zero_page() since shrinker holds on reference.
Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.
Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.
[lliubbo@gmail.com: fix mismerge]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:05 +0000 (13:51 -0800)]
thp: lazy huge zero page allocation
Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.
cmpxchg() is used to avoid race on huge_zero_pfn initialization.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Wed, 12 Dec 2012 21:51:02 +0000 (13:51 -0800)]
thp: setup huge zero page on non-write page fault
All code paths seems covered. Now we can map huge zero page on read page
fault.
We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup huge zero page (ENOMEM) we fallback to
handle_pte_fault() as we normally do in THP.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>