platform/adaptation/renesas_rcar/renesas_kernel.git
13 years agogpio/misc: Add MODULE_ALIAS entries for CS5535 functions
Andres Salomon [Thu, 2 Dec 2010 03:55:10 +0000 (19:55 -0800)]
gpio/misc: Add MODULE_ALIAS entries for CS5535 functions

This adds MODULE_ALIAS entries to the various cs5535 subdevice modules; this
allows the modules to automatically be loaded when cs5535-mfd loads.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomisc: Convert cs5535-mfgpt from pci device to platform device
Andres Salomon [Sat, 23 Oct 2010 07:41:14 +0000 (00:41 -0700)]
misc: Convert cs5535-mfgpt from pci device to platform device

The cs5535-mfd driver now takes care of the PCI BAR handling; this
simplifies the mfgpt driver a bunch.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agogpio: Convert cs5535 from pci device to platform device
Andres Salomon [Sat, 23 Oct 2010 07:41:09 +0000 (00:41 -0700)]
gpio: Convert cs5535 from pci device to platform device

The cs5535-mfd driver now takes care of the PCI BAR handling; this
simplifies the gpio driver a lot.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Fix cs5535 warning on x86-64
Andres Salomon [Tue, 30 Nov 2010 21:54:39 +0000 (13:54 -0800)]
mfd: Fix cs5535 warning on x86-64

ARRAY_SIZE() returns size_t; use %zu instead of %d so that we don't
get warnings on x86-64.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Implement runtime PM for WM8994 core driver
Mark Brown [Fri, 26 Nov 2010 17:19:35 +0000 (17:19 +0000)]
mfd: Implement runtime PM for WM8994 core driver

Allow the WM8994 to completely power off, including disabling the LDOs
if they are software controlled, when it goes idle. The CODEC subdevice
controls activity for the MFD as a whole.

If the GPIOs need to be used while the device is active runtime PM
should be disabled for the device by machine specific code.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Provide pm_runtime_no_callbacks flag in cell data
Mark Brown [Fri, 26 Nov 2010 17:19:34 +0000 (17:19 +0000)]
mfd: Provide pm_runtime_no_callbacks flag in cell data

Allow MFD cells to have pm_runtime_no_callbacks() called on them during
registration. This causes the runtime PM framework to ignore them,
allowing use of runtime PM to suspend the device as a whole even if
not all drivers for the MFD can usefully implement runtime PM. For
example, RTCs are likely to run continuously regardless of the power
state of the system.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Fix ab8500-debug indentation errors
Mattias Wallin [Fri, 26 Nov 2010 12:06:39 +0000 (13:06 +0100)]
mfd: Fix ab8500-debug indentation errors

Replace spaces with proper tabs.

Signed-off-by: Mattias Wallin <mattias.wallin@stericsson.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Convert WM8994 to new irq_ interrupt methods
Mark Brown [Wed, 24 Nov 2010 18:01:44 +0000 (18:01 +0000)]
mfd: Convert WM8994 to new irq_ interrupt methods

Kernel 2.6.37 adds new interrupt methods which take a struct irq_data
rather than an irq number. Convert over to these as they will become
mandatory in future.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Convert WM835x to new irq_ interrupt methods
Mark Brown [Wed, 24 Nov 2010 18:01:43 +0000 (18:01 +0000)]
mfd: Convert WM835x to new irq_ interrupt methods

Kernel 2.6.37 adds new interrupt methods which take a struct irq_data
rather than an irq number. Convert over to these as they will become
mandatory in future.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Convert WM831x to new irq_ interrupt methods
Mark Brown [Wed, 24 Nov 2010 18:01:42 +0000 (18:01 +0000)]
mfd: Convert WM831x to new irq_ interrupt methods

Kernel 2.6.37 adds new interrupt methods which take a struct irq_data
rather than an irq number. Convert over to these as they will become
mandatory in future.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Add WM8326 support
Mark Brown [Wed, 24 Nov 2010 18:01:41 +0000 (18:01 +0000)]
mfd: Add WM8326 support

The WM8326 is a high performance variant of the WM832x series with
no software visible differences.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Simplify WM832x subdevice instantiation
Mark Brown [Wed, 24 Nov 2010 18:01:40 +0000 (18:01 +0000)]
mfd: Simplify WM832x subdevice instantiation

All the current WM832x devices have the same set of subdevices so can
just use multiple case statements with a single body.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Use printf extension %pR for struct resource
Joe Perches [Fri, 12 Nov 2010 21:37:56 +0000 (13:37 -0800)]
mfd: Use printf extension %pR for struct resource

Using %pR standardizes the struct resource output.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Add cs5535-mfd driver for AMD Geode's CS5535/CS5536 support
Andres Salomon [Fri, 26 Nov 2010 10:52:35 +0000 (11:52 +0100)]
mfd: Add cs5535-mfd driver for AMD Geode's CS5535/CS5536 support

Add an MFD driver to handle the ISA device on CS5535 and CS5536
southbridges. This ISA bridge is actually multiple devices: GPIOs,
MFGPTs, etc.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Don't open-code mc13xxx_unlock
Uwe Kleine-König [Thu, 11 Nov 2010 15:47:50 +0000 (16:47 +0100)]
mfd: Don't open-code mc13xxx_unlock

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Include <linux/gpio.h> instead of <asm/gpio.h>
Axel Lin [Wed, 10 Nov 2010 07:49:41 +0000 (15:49 +0800)]
mfd: Include <linux/gpio.h> instead of <asm/gpio.h>

As warned by checkpatch.pl, use #include <linux/gpio.h> instead
of <asm/gpio.h>.

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Include <linux/io.h> instead of <asm/io.h>
Axel Lin [Wed, 10 Nov 2010 07:47:51 +0000 (15:47 +0800)]
mfd: Include <linux/io.h> instead of <asm/io.h>

As warned by checkpatch.pl, use #include <linux/io.h> instead of <asm/io.h>

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Acked-by: Ben Dooks <ben@simtec.co.uk>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agomfd: Update WARN uses
Joe Perches [Sat, 30 Oct 2010 21:08:32 +0000 (14:08 -0700)]
mfd: Update WARN uses

Remove KERN_<level>.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
13 years agoe1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs
Bruce Allan [Fri, 31 Dec 2010 06:10:01 +0000 (06:10 +0000)]
e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs

Some minor comment errors and whitespace issues discovered while looking
into this are also addressed.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
13 years agoe1000e: update Copyright for 2011
Bruce Allan [Tue, 4 Jan 2011 01:16:44 +0000 (01:16 +0000)]
e1000e: update Copyright for 2011

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
13 years agoe1000: Avoid unhandled IRQ
Jesse Brandeburg [Thu, 13 Jan 2011 07:48:13 +0000 (07:48 +0000)]
e1000: Avoid unhandled IRQ

If hardware asserted an interrupt and driver is down,
then there is nothing to do so return IRQ_HANDLED
instead of IRQ_NONE. Returning IRQ_NONE in above
situation causes screaming IRQ on virtual machines.

CC: Andy Gospodarek <gospo@redhat.com>
Signed-off-by: Tushar Dave <tushar.n.dave@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
13 years agodrm/i915: Disable GPU semaphores on SandyBridge mobile
Chris Wilson [Fri, 14 Jan 2011 09:46:38 +0000 (09:46 +0000)]
drm/i915: Disable GPU semaphores on SandyBridge mobile

Hopefully, this is a temporary measure whilst the root cause is
understood. At the moment, we experience a hard hang whilst looping
urbanterror that has been identified as a result of the use of
semaphores, but so far only on SNB mobile.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32752
Tested-by: mengmeng.meng@intel.com
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
13 years agoARM mxs: clkdev related compile fixes
Sascha Hauer [Thu, 13 Jan 2011 15:59:25 +0000 (16:59 +0100)]
ARM mxs: clkdev related compile fixes

Since commit

6d803ba (ARM: 6483/1: arm & sh: factorised duplicated clkdev.c)

platforms need to select CLKDEV_LOOKUP instead of COMMON_CLKDEV and need
to include <linux/clkdev.h>.

Cc: Shawn Guo <shawn.guo@freescale.com>
Cc: Lothar Waßmann <LW@KARO-electronics.de>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
13 years agoARM: 6624/1: fix dependency for CONFIG_SMP_ON_UP
Nicolas Pitre [Fri, 14 Jan 2011 06:33:24 +0000 (07:33 +0100)]
ARM: 6624/1: fix dependency for CONFIG_SMP_ON_UP

This depends on !XIP_KERNEL and not !XIP.

Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
13 years agoARM: 6623/1: Thumb-2: Fix out-of-range offset for Thumb-2 in proc-v7.S
Dave Martin [Thu, 13 Jan 2011 23:43:01 +0000 (00:43 +0100)]
ARM: 6623/1: Thumb-2: Fix out-of-range offset for Thumb-2 in proc-v7.S

Commit d30e45e (ARM: pgtable: switch order of Linux vs hardware page tables)
introduced a pre-increment addressing offset which is out of range for
Thumb-2.  Thumb-2 only permits offsets <256.  So split the intruction in
two for Thumb-2.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
13 years agoARM i.MX mx31_3ds: Fix MC13783 regulator names
Sascha Hauer [Fri, 14 Jan 2011 08:44:02 +0000 (09:44 +0100)]
ARM i.MX mx31_3ds: Fix MC13783 regulator names

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
13 years agocgroups: Fix a lockdep warning at cgroup removal
Li Zefan [Fri, 14 Jan 2011 03:34:34 +0000 (11:34 +0800)]
cgroups: Fix a lockdep warning at cgroup removal

Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry
lock, which caused a lockdep warning.

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
13 years agofs: namei fix ->put_link on wrong inode in do_filp_open
Nick Piggin [Fri, 14 Jan 2011 08:42:43 +0000 (08:42 +0000)]
fs: namei fix ->put_link on wrong inode in do_filp_open

J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agospi: Enable SPI driver for S5P6440 and S5P6450
Abhilash Kesavan [Wed, 12 Jan 2011 06:00:23 +0000 (15:00 +0900)]
spi: Enable SPI driver for S5P6440 and S5P6450

This patch enables the existing S3C64XX series SPI driver for S5P64X0
and removed dependency on EXPERIMENTAL because we don't need it now.

v3: Changed dependency of S3C64XX_DMA
v2: Removed dependency on EXPERIMENTAL

Signed-off-by: Abhilash Kesavan <a.kesavan@samsung.com>
Signed-off-by: Sangbeom Kim <sbkim73@samsung.com>
Acked-by: Jassi Brar <jassi.brar@samsung.com>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
13 years agoblock cfq: compensate preempted queue even if it has no slice assigned
Shaohua Li [Fri, 14 Jan 2011 07:41:03 +0000 (08:41 +0100)]
block cfq: compensate preempted queue even if it has no slice assigned

If a queue is preempted before it gets slice assigned, the queue doesn't get
compensation, which looks unfair. For such queue, we compensate it for a whole
slice.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agoblock cfq: make queue preempt work for queues from different workload
Shaohua Li [Fri, 14 Jan 2011 07:41:02 +0000 (08:41 +0100)]
block cfq: make queue preempt work for queues from different workload

I got this:
             fio-874   [007]  2157.724514:   8,32   m   N cfq874 preempt
             fio-874   [007]  2157.724519:   8,32   m   N cfq830 slice expired t=1
             fio-874   [007]  2157.724520:   8,32   m   N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0
             fio-874   [007]  2157.724521:   8,32   m   N cfq830 set_active wl_prio:0 wl_type:0
             fio-874   [007]  2157.724522:   8,32   m   N cfq830 Not idling. st->count:1

cfq830 is an async queue, and preempted by a sync queue cfq874. But since we
have cfqg->saved_workload_slice mechanism, the preempt is a nop.
Looks currently our preempt is totally broken if the two queues are not from
the same workload type.
Below patch fixes it. This will might make async queue starvation, but it's
what our old code does before cgroup is added.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
13 years agommc: sdhci-of: fix build on non-powerpc platforms
Rob Herring [Tue, 16 Nov 2010 20:33:52 +0000 (14:33 -0600)]
mmc: sdhci-of: fix build on non-powerpc platforms

Explicitly include err.h, of_address.h and of_irq.h.
Make use of machine_is() conditional on PPC.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
13 years agor8169: keep firmware in memory.
françois romieu [Thu, 13 Jan 2011 13:07:53 +0000 (13:07 +0000)]
r8169: keep firmware in memory.

The firmware agent is not available during resume. Loading the firmware
during open() (see eee3a96c6368f47df8df5bd4ed1843600652b337) is not
enough.

close() is run during resume through rtl8169_reset_task(), whence the
mildly natural release of firmware in the driver removal method instead.

It will help with http://bugs.debian.org/609538. It will not avoid
the 60 seconds delay when:
- there is no firmware
- the driver is loaded and the device is not up before a suspend/resume

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Tested-by: Jarek Kamiński <jarek@vilo.eu.org>
Cc: Hayes <hayeswang@realtek.com>
Cc: Ben Hutchings <benh@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agonetdev: tilepro: Use is_unicast_ether_addr helper
Tobias Klauser [Wed, 12 Jan 2011 22:15:08 +0000 (22:15 +0000)]
netdev: tilepro: Use is_unicast_ether_addr helper

Use is_unicast_ether_addr from linux/etherdevice.h instead of custom
macros.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agoetherdevice.h: Add is_unicast_ether_addr function
Tobias Klauser [Wed, 12 Jan 2011 22:14:56 +0000 (22:14 +0000)]
etherdevice.h: Add is_unicast_ether_addr function

From a check for !is_multicast_ether_addr it is not always obvious that
we're checking for a unicast address. So add this helper function to
make those code paths easier to read.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agoks8695net: Use default implementation of ethtool_ops::get_link
Ben Hutchings [Thu, 13 Jan 2011 07:52:51 +0000 (07:52 +0000)]
ks8695net: Use default implementation of ethtool_ops::get_link

This is completely untested as I don't have an ARM build environment.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agoks8695net: Disable non-working ethtool operations
Ben Hutchings [Thu, 13 Jan 2011 07:50:14 +0000 (07:50 +0000)]
ks8695net: Disable non-working ethtool operations

Some ethtool operations can only be implemented for the WAN port, and
not all such operations are allowed to return an error code such as
-EOPNOTSUPP.  Therefore, define two separate ethtool_ops structures
for WAN and non-WAN ports; simplify and rename the WAN-only functions.

This is completely untested as I don't have an ARM build environment.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agoUSB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
Jesper Juhl [Thu, 13 Jan 2011 11:40:11 +0000 (11:40 +0000)]
USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.

skb_clone() dynamically allocates memory and may fail. If it does it
returns NULL. This means we'll dereference a NULL pointer in
drivers/net/usb/cdc_ncm.c::cdc_ncm_rx_fixup().
As far as I can tell, the proper way to deal with this is simply to goto
the error label.

Furthermore gcc complains that 'skb' may be used uninitialized:
  drivers/net/usb/cdc_ncm.c: In function ‘cdc_ncm_rx_fixup’:
  drivers/net/usb/cdc_ncm.c:922:18: warning: ‘skb’ may be used uninitialized in this function
and I believe it is right. On the line where we
  pr_debug("invalid frame detected (ignored)" ...
we are using the local variable 'skb' but nothing has ever been assigned
to that variable yet. I believe the correct fix for that is to use
'skb_in' instead.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agovxge: Remember to release firmware after upgrading firmware
Jesper Juhl [Thu, 13 Jan 2011 10:25:20 +0000 (10:25 +0000)]
vxge: Remember to release firmware after upgrading firmware

Regardless of whether the firmware update being performed by
vxge_fw_upgrade() is a success or not we must still remember to always
release_firmware() before returning.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Ram Vepa <ram.vepa@exar.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agonetdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
Joe Perches [Wed, 12 Jan 2011 18:08:04 +0000 (18:08 +0000)]
netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr

Remove code that has no effect.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agoipsec: update MAX_AH_AUTH_LEN to support sha512
Nicolas Dichtel [Thu, 13 Jan 2011 11:51:03 +0000 (11:51 +0000)]
ipsec: update MAX_AH_AUTH_LEN to support sha512

icv_truncbits is set to 256 for sha512, so update
MAX_AH_AUTH_LEN to 64.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agonet: remove dev_txq_stats_fold()
Eric Dumazet [Wed, 12 Jan 2011 12:13:14 +0000 (12:13 +0000)]
net: remove dev_txq_stats_fold()

After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

1) ixgbe can be converted to use existing tx_ring counters.

2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.

3) sch_teql : almost revert ab35cd4b8f42 (Use net_device internal stats)
    Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)

4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets

This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.

Ref: http://www.spinics.net/lists/netdev/msg149202.html

Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13 years agoMerge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux...
Linus Torvalds [Fri, 14 Jan 2011 04:15:35 +0000 (20:15 -0800)]
Merge branch 'release' of git://git./linux/kernel/git/lenb/linux-acpi-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
  ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
  ACPI: fix resource check message
  ACPI / Battery: Update information on info notification and resume
  ACPI: Drop device flag wake_capable
  ACPI: Always check if _PRW is present before trying to evaluate it
  ACPI / PM: Check status of power resources under mutexes
  ACPI / PM: Rename acpi_power_off_device()
  ACPI / PM: Drop acpi_power_nocheck
  ACPI / PM: Drop acpi_bus_get_power()
  Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
  ACPI / Fan: Rework the handling of power resources
  ACPI / PM: Register power resource devices as soon as they are needed
  ACPI / PM: Register acpi_power_driver early
  ACPI / PM: Add function for updating device power state consistently
  ACPI / PM: Add function for device power state initialization
  ACPI / PM: Introduce __acpi_bus_get_power()
  ACPI / PM: Introduce function for refcounting device power resources
  ACPI / PM: Add functions for manipulating lists of power resources
  ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
  ACPICA: Update version to 20101209
  ...

13 years agoMerge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb...
Linus Torvalds [Fri, 14 Jan 2011 04:15:18 +0000 (20:15 -0800)]
Merge branch 'idle-release' of git://git./linux/kernel/git/lenb/linux-idle-2.6

* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
  cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer
  intel_idle: open broadcast clock event
  cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific
  cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle
  cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions
  SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW
  cpuidle: delete NOP CPUIDLE_FLAG_POLL
  ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs
  cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL
  ACPI, intel_idle: Cleanup idle= internal variables
  cpuidle: Make cpuidle_enable_device() call poll_idle_init()
  intel_idle: update Sandy Bridge core C-state residency targets

13 years agoMerge branch 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb...
Linus Torvalds [Fri, 14 Jan 2011 04:15:02 +0000 (20:15 -0800)]
Merge branch 'sfi-release' of git://git./linux/kernel/git/lenb/linux-sfi-2.6

* 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6:
  SFI: use ioremap_cache() instead of ioremap()

13 years agoMerge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 14 Jan 2011 04:14:13 +0000 (20:14 -0800)]
Merge branch 'vfs-scale-working' of git://git./linux/kernel/git/npiggin/linux-npiggin

* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
  fs: fix do_last error case when need_reval_dot
  nfs: add missing rcu-walk check
  fs: hlist UP debug fixup
  fs: fix dropping of rcu-walk from force_reval_path
  fs: force_reval_path drop rcu-walk before d_invalidate
  fs: small rcu-walk documentation fixes

Fixed up trivial conflicts in Documentation/filesystems/porting

13 years agofs: fix do_last error case when need_reval_dot
J. R. Okajima [Fri, 14 Jan 2011 03:56:04 +0000 (03:56 +0000)]
fs: fix do_last error case when need_reval_dot

When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'

Should we keep and return the initialized 'error' in this case.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agonfs: add missing rcu-walk check
Nick Piggin [Fri, 14 Jan 2011 02:48:39 +0000 (02:48 +0000)]
nfs: add missing rcu-walk check

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agoMerge branch 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Linus Torvalds [Fri, 14 Jan 2011 02:46:48 +0000 (18:46 -0800)]
Merge branch 'stable/gntdev' of git://git./linux/kernel/git/konrad/xen

* 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/p2m: Fix module linking error.
  xen p2m: clear the old pte when adding a page to m2p_override
  xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
  xen: introduce gnttab_map_refs and gnttab_unmap_refs
  xen p2m: transparently change the p2m mappings in the m2p override
  xen/gntdev: Fix circular locking dependency
  xen/gntdev: stop using "token" argument
  xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
  xen: add m2p override mechanism
  xen: move p2m handling to separate file
  xen/gntdev: add VM_PFNMAP to vma
  xen/gntdev: allow usermode to map granted pages
  xen: define gnttab_set_map_op/unmap_op

Fix up trivial conflict in drivers/xen/Kconfig

13 years agoMerge branch 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 14 Jan 2011 02:44:52 +0000 (18:44 -0800)]
Merge branch 'stable/platform-pci-fixes' of git://git./linux/kernel/git/konrad/xen

* 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen-platform: Fix compile errors if CONFIG_PCI is not enabled.
  xen: rename platform-pci module to xen-platform-pci.
  xen-platform: use PCI interfaces to request IO and MEM resources.

13 years agofs: hlist UP debug fixup
Nick Piggin [Fri, 14 Jan 2011 02:36:43 +0000 (02:36 +0000)]
fs: hlist UP debug fixup

Po-Yu Chuang <ratbert.chuang@gmail.com> noticed that hlist_bl_set_first could
crash on a UP system when LIST_BL_LOCKMASK is 0, because

LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK));

always evaulates to true.

Fix the expression, and also avoid a dependency between bit spinlock
implementation and list bl code (list code shouldn't know anything
except that bit 0 is set when adding and removing elements). Eventually
if a good use case comes up, we might use this list to store 1 or more
arbitrary bits of data, so it really shouldn't be tied to locking either,
but for now they are helpful for debugging.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agofs: fix dropping of rcu-walk from force_reval_path
Nick Piggin [Fri, 14 Jan 2011 02:36:19 +0000 (02:36 +0000)]
fs: fix dropping of rcu-walk from force_reval_path

As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agofs: force_reval_path drop rcu-walk before d_invalidate
Nick Piggin [Fri, 14 Jan 2011 02:35:53 +0000 (02:35 +0000)]
fs: force_reval_path drop rcu-walk before d_invalidate

d_revalidate can return in rcu-walk mode even when it returns 0.  We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.

(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agofs: small rcu-walk documentation fixes
Nick Piggin [Fri, 14 Jan 2011 02:26:53 +0000 (02:26 +0000)]
fs: small rcu-walk documentation fixes

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
13 years agonfsd: break lease on unlink, link, and rename
J. Bruce Fields [Tue, 11 Jan 2011 18:55:46 +0000 (13:55 -0500)]
nfsd: break lease on unlink, link, and rename

Any change to any of the links pointing to an entry should also break
delegations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
13 years agonfsd4: break lease on nfsd setattr
J. Bruce Fields [Tue, 11 Jan 2011 17:54:39 +0000 (12:54 -0500)]
nfsd4: break lease on nfsd setattr

Leases (delegations) should really be broken on any metadata change, not
just on size change.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
13 years agonfsd: don't support msnfs export option
J. Bruce Fields [Tue, 11 Jan 2011 19:07:12 +0000 (14:07 -0500)]
nfsd: don't support msnfs export option

We've long had these pointless #ifdef MSNFS's sprinkled throughout the
code--pointless because MSNFS is always defined (and we give no config
option to make that easy to change).  So we could just remove the
ifdef's and compile the resulting code unconditionally.

But as long as we're there: why not just rip out this code entirely?
The only purpose is to implement the "msnfs" export option which turns
on Windows-like behavior in some cases, and:

- the export option isn't documented anywhere;
- the userland utilities (which would need to be able to parse
  "msnfs" in an export file) don't support it;
- I don't know how to maintain this, as I don't know what the
  proper behavior is; and
- google shows no evidence that anyone has ever used this.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
13 years agonfsd4: initialize cb_per_client
J. Bruce Fields [Thu, 13 Jan 2011 22:08:19 +0000 (17:08 -0500)]
nfsd4: initialize cb_per_client

Otherwise a callback that is aborted before it runs will result in a
list_del on an uninitialized list head.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
13 years agomemcg: fix memory migration of shmem swapcache
Daisuke Nishimura [Thu, 13 Jan 2011 23:47:43 +0000 (15:47 -0800)]
memcg: fix memory migration of shmem swapcache

In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".

But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid.  As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.

This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: use [kv]zalloc[_node] rather than [kv]malloc+memset
Jesper Juhl [Thu, 13 Jan 2011 23:47:42 +0000 (15:47 -0800)]
memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset

In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory.  This can be more efficiently
achieved by using kzalloc() and vzalloc().  There's also one situation
where we can use kzalloc_node() - this is what's new in this version of
the patch.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: fix deadlock between cpuset and memcg
Daisuke Nishimura [Thu, 13 Jan 2011 23:47:41 +0000 (15:47 -0800)]
memcg: fix deadlock between cpuset and memcg

Commit b1dd693e ("memcg: avoid deadlock between move charge and
try_charge()") can cause another deadlock about mmap_sem on task migration
if cpuset and memcg are mounted onto the same mount point.

After the commit, cgroup_attach_task() has sequence like:

cgroup_attach_task()
  ss->can_attach()
    cpuset_can_attach()
    mem_cgroup_can_attach()
      down_read(&mmap_sem)        (1)
  ss->attach()
    cpuset_attach()
      mpol_rebind_mm()
        down_write(&mmap_sem)     (2)
        up_write(&mmap_sem)
      cpuset_migrate_mm()
        do_migrate_pages()
          down_read(&mmap_sem)
          up_read(&mmap_sem)
    mem_cgroup_move_task()
      mem_cgroup_clear_mc()
        up_read(&mmap_sem)

We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

But the commit itself is necessary to fix deadlocks which have existed
before the commit like:

Ex.1)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |  down_write(&mmap_sem)
      mc.moving_task = current          |    ..
      mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
        mem_cgroup_count_precharge()    |    prepare_to_wait()
          down_read(&mmap_sem)          |    if (mc.moving_task)
          -> cannot aquire the lock     |    -> true
                                        |      schedule()
                                        |      -> move charge should wake it up

Ex.2)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |
      mc.moving_task = current          |
      mem_cgroup_precharge_mc()         |
        mem_cgroup_count_precharge()    |
          down_read(&mmap_sem)          |
          ..                            |
          up_read(&mmap_sem)            |
                                        |  down_write(&mmap_sem)
    mem_cgroup_move_task()              |    ..
      mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
        down_read(&mmap_sem)            |    prepare_to_wait()
        -> cannot aquire the lock       |    if (mc.moving_task)
                                        |    -> true
                                        |      schedule()
                                        |      -> move charge should wake it up

This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
   has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
   mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
   all extra charges, wake up all waiters, and retry trylock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list()
Minchan Kim [Thu, 13 Jan 2011 23:47:40 +0000 (15:47 -0800)]
memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list()

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: fix unit mismatch in memcg oom limit calculation
Johannes Weiner [Thu, 13 Jan 2011 23:47:39 +0000 (15:47 -0800)]
memcg: fix unit mismatch in memcg oom limit calculation

Adding the number of swap pages to the byte limit of a memory control
group makes no sense.  Convert the pages to bytes before adding them.

The only user of this code is the OOM killer, and the way it is used means
that the error results in a higher OOM badness value.  Since the cgroup
limit is the same for all tasks in the cgroup, the error should have no
practical impact at the moment.

But let's not wait for future or changing users to trip over it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: add lock to synchronize page accounting and migration
KAMEZAWA Hiroyuki [Thu, 13 Jan 2011 23:47:38 +0000 (15:47 -0800)]
memcg: add lock to synchronize page accounting and migration

Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
accounting and migration code.  This reworks the locking scheme of
_update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
which is always taken under IRQ disable.

1. If pages are being migrated from a memcg, then updates to that
   memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
   move_lock_page_cgroup().  In an upcoming commit, memcg dirty page
   accounting will be updating memcg page accounting (specifically: num
   writeback pages) from IRQ context (softirq).  Avoid a deadlocking
   nested spin lock attempt by disabling irq on the local processor when
   grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race with
   move_account().  So, IRQ awareness of lock_page_cgroup() itself is not
   a problem.  The problem is between mem_cgroup_update_page_stat() and
   mem_cgroup_move_account_page().

Trade-off:
  * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
  * adding a new lock makes move_account() slower.  Score is
    here.

Performance Impact: moving a 8G anon process.

Before:
real    0m0.792s
user    0m0.000s
sys     0m0.780s

After:
real    0m0.854s
user    0m0.000s
sys     0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrea Righi <arighi@develer.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: create extensible page stat update routines
Greg Thelen [Thu, 13 Jan 2011 23:47:37 +0000 (15:47 -0800)]
memcg: create extensible page stat update routines

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed.  However, these more
general interfaces allow for new statistics to be more easily added.  New
statistics are added with memcg dirty page accounting.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: document cgroup dirty memory interfaces
Greg Thelen [Thu, 13 Jan 2011 23:47:36 +0000 (15:47 -0800)]
memcg: document cgroup dirty memory interfaces

Document cgroup dirty memory interfaces and statistics.

[akpm@linux-foundation.org: fix use_hierarchy description]
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomemcg: add page_cgroup flags for dirty page tracking
Greg Thelen [Thu, 13 Jan 2011 23:47:35 +0000 (15:47 -0800)]
memcg: add page_cgroup flags for dirty page tracking

This patchset provides the ability for each cgroup to have independent
dirty page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to
reclaim) page cache used by a cgroup.  So, in case of multiple cgroup
writers, they will not be able to consume more than their designated share
of dirty pages and will be forced to perform write-out if they cross that
limit.

The patches are based on a series proposed by Andrea Righi in Mar 2010.

Overview:

- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
  unstable.

- Extend mem_cgroup to record the total number of pages in each of the
  interesting dirty states (dirty, writeback, unstable_nfs).

- Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
  limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
  via cgroupfs control files.

- Consider both system and per-memcg dirty limits in page writeback when
  deciding to queue background writeback or block for foreground writeback.

Known shortcomings:

- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
  writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
  just inodes contributing dirty pages to the cgroup exceeding its limit.

- When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
  implementation detail.  An enhanced implementation is needed to check the
  chain of parents to ensure that no dirty limit is exceeded.

Performance data:
- A page fault microbenchmark workload was used to measure performance, which
  can be called in read or write mode:
        f = open(foo. $cpu)
        truncate(f, 4096)
        alarm(60)
        while (1) {
                p = mmap(f, 4096)
                if (write)
*p = 1
else
x = *p
                munmap(p)
        }

- The workload was called for several points in the patch series in different
  modes:
  - s_read is a single threaded reader
  - s_write is a single threaded writer
  - p_read is a 16 thread reader, each operating on a different file
  - p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
  --repeat 3".  The -a option was used for parallel (p_*) runs.

- All numbers are page fault rate (M/sec).  Higher is better.

- To compare the performance of a kernel without non-memcg compare the first and
  last rows, neither has memcg configured.  The first row does not include any
  of these memcg patches.

- To compare the performance of using memcg dirty limits, compare the baseline
  (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
  row titled "all patches").

                           root_cgroup                    child_cgroup
                 s_read s_write p_read p_write   s_read s_write p_read p_write
mmotm w/o memcg   0.428  0.390   0.429  0.388
mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
all patches       0.431  0.402   0.427  0.395
  w/o memcg

This patch:

Add additional flags to page_cgroup to track dirty pages within a
mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm: batch activate_page() to reduce lock contention
Shaohua Li [Thu, 13 Jan 2011 23:47:34 +0000 (15:47 -0800)]
mm: batch activate_page() to reduce lock contention

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used.  We could do batch activate_page() to reduce the lock
contention.  The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file.  Each process read access the
whole file and then exit.  The process exit will do unmap_vmas() and cause
a lot of activate_page() call.  In such workload, we saw about 58% total
time reduction with below patch.  Other workloads with a lot of
activate_page also benefits a lot too.

I tested some microbenchmarks:
case-anon-cow-rand-mt 0.58%
case-anon-cow-rand -3.30%
case-anon-cow-seq-mt -0.51%
case-anon-cow-seq -5.68%
case-anon-r-rand-mt 0.23%
case-anon-r-rand 0.81%
case-anon-r-seq-mt -0.71%
case-anon-r-seq -1.99%
case-anon-rx-rand-mt 2.11%
case-anon-rx-seq-mt 3.46%
case-anon-w-rand-mt -0.03%
case-anon-w-rand -0.50%
case-anon-w-seq-mt -1.08%
case-anon-w-seq -0.12%
case-anon-wx-rand-mt -5.02%
case-anon-wx-seq-mt -1.43%
case-fork 1.65%
case-fork-sleep -0.07%
case-fork-withmem 1.39%
case-hugetlb -0.59%
case-lru-file-mmap-read-mt -0.54%
case-lru-file-mmap-read 0.61%
case-lru-file-mmap-read-rand -2.24%
case-lru-file-readonce -0.64%
case-lru-file-readtwice -11.69%
case-lru-memcg -1.35%
case-mmap-pread-rand-mt 1.88%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq-mt 0.89%
case-mmap-pread-seq -69.72%
case-mmap-xread-rand-mt 0.71%
case-mmap-xread-seq-mt 0.38%

The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%

which use activate_page a lot.  others are basically variations because
each run has slightly difference.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm: simplify code of swap.c
Shaohua Li [Thu, 13 Jan 2011 23:47:33 +0000 (15:47 -0800)]
mm: simplify code of swap.c

Clean up code and remove duplicate code.  Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm/page_alloc.c: don't cache `current' in a local
Andrew Morton [Thu, 13 Jan 2011 23:47:32 +0000 (15:47 -0800)]
mm/page_alloc.c: don't cache `current' in a local

It's old-fashioned and unneeded.

akpm:/usr/src/25> size mm/page_alloc.o
   text    data     bss     dec     hex filename
  39884 1241317   18808 1300009  13d629 mm/page_alloc.o (before)
  39838 1241317   18808 1299963  13d5fb mm/page_alloc.o (after)

Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm: fix hugepage migration
Hugh Dickins [Thu, 13 Jan 2011 23:47:31 +0000 (15:47 -0800)]
mm: fix hugepage migration

2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().

I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm: fix migration hangs on anon_vma lock
Hugh Dickins [Thu, 13 Jan 2011 23:47:30 +0000 (15:47 -0800)]
mm: fix migration hangs on anon_vma lock

Increased usage of page migration in mmotm reveals that the anon_vma
locking in unmap_and_move() has been deficient since 2.6.36 (or even
earlier).  Review at the time of f18194275c39835cb84563500995e0d503a32d9a
("mm: fix hang on anon_vma->root->lock") missed the issue here: the
anon_vma to which we get a reference may already have been freed back to
its slab (it is in use when we check page_mapped, but that can change),
and so its anon_vma->root may be switched at any moment by reuse in
anon_vma_prepare.

Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
then we don't need any rcu read locking over here.

In removing the rcu_unlock label: since PageAnon is a bit in
page->mapping, it's impossible for a !page->mapping page to be anon; but
insert VM_BUG_ON in case the implementation ever changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agoksm: drain pagevecs to lru
Hugh Dickins [Thu, 13 Jan 2011 23:47:29 +0000 (15:47 -0800)]
ksm: drain pagevecs to lru

It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc:Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agohugetlb: fix handling of parse errors in sysfs
Eric B Munson [Thu, 13 Jan 2011 23:47:28 +0000 (15:47 -0800)]
hugetlb: fix handling of parse errors in sysfs

When parsing changes to the huge page pool sizes made from userspace via
the sysfs interface, bogus input values are being covered up by
nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
when strict_strtoul returns an error.  This can cause an infinite loop in
the nr_hugepages_store code.  This patch changes the return value for
these functions to -EINVAL when strict_strtoul returns an error.

Signed-off-by: Eric B Munson <emunson@mgebm.net>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agohugetlb: do not allow pagesize >= MAX_ORDER pool adjustment
Eric B Munson [Thu, 13 Jan 2011 23:47:27 +0000 (15:47 -0800)]
hugetlb: do not allow pagesize >= MAX_ORDER pool adjustment

Huge pages with order >= MAX_ORDER must be allocated at boot via the
kernel command line, they cannot be allocated or freed once the kernel is
up and running.  Currently we allow values to be written to the sysfs and
sysctl files controling pool size for these huge page sizes.  This patch
makes the store functions for nr_hugepages and nr_overcommit_hugepages
return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

[akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
[caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agohugetlb: check the return value of string conversion in sysctl handler
Michal Hocko [Thu, 13 Jan 2011 23:47:26 +0000 (15:47 -0800)]
hugetlb: check the return value of string conversion in sysctl handler

proc_doulongvec_minmax may fail if the given buffer doesn't represent a
valid number.  If we provide something invalid we will initialize the
resulting value (nr_overcommit_huge_pages in this case) to a random value
from the stack.

The issue was introduced by a3d0c6aa when the default handler has been
replaced by the helper function where we do not check the return value.

Reproducer:
echo "" > /proc/sys/vm/nr_overcommit_hugepages

[akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agofs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc
Stefan Hajnoczi [Thu, 13 Jan 2011 23:47:26 +0000 (15:47 -0800)]
fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc

The sync_inodes_sb() function does not have a return value.  Remove the
outdated documentation comment.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc()
Andrew Morton [Thu, 13 Jan 2011 23:47:25 +0000 (15:47 -0800)]
mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc()

As it stands this code will degenerate into a busy-wait if the calling task
has signal_pending().

Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm/dmapool.c: take lock only once in dma_pool_free()
Rolf Eike Beer [Thu, 13 Jan 2011 23:47:24 +0000 (15:47 -0800)]
mm/dmapool.c: take lock only once in dma_pool_free()

dma_pool_free() scans for the page to free in the pool list holding the
pool lock.  Then it releases the lock basically to acquire it immediately
again.  Modify the code to only take the lock once.

This will do some additional loops and computations with the lock held in
if memory debugging is activated.  If it is not activated the only new
operations with this lock is one if and one substraction.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists
KyongHo Cho [Thu, 13 Jan 2011 23:47:24 +0000 (15:47 -0800)]
mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists

The previous approach of calucation of combined index was

page_idx & ~(1 << order))

but we have same result with

page_idx & buddy_idx

This reduces instructions slightly as well as enhances readability.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agobrk: fix min_brk lower bound computation for COMPAT_BRK
Jiri Kosina [Thu, 13 Jan 2011 23:47:23 +0000 (15:47 -0800)]
brk: fix min_brk lower bound computation for COMPAT_BRK

Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
be overriden by randomize_va_space sysctl.

If this is the case, the min_brk computation in sys_brk() implementation
is wrong, as it solely takes into account COMPAT_BRK setting, assuming
that brk start is not randomized.  But that might not be the case if
randomize_va_space sysctl has been set to '2' at the time the binary has
been loaded from disk.

In such case, the check has to be done in a same way as in
!CONFIG_COMPAT_BRK case.

In addition to that, the check for the COMPAT_BRK case introduced back in
a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
but mm->end_data instead, as that's where the legacy applications expect
brk section to start (i.e.  immediately after last global variable).

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common()
Jesper Juhl [Thu, 13 Jan 2011 23:47:22 +0000 (15:47 -0800)]
mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common()

The NODEMASK_ALLOC macro may dynamically allocate memory for its second
argument ('nodes_allowed' in this context).

In nr_hugepages_store_common() we may abort early if strict_strtoul()
fails, but in that case we do not free the memory already allocated to
'nodes_allowed', causing a memory leak.

This patch closes the leak by freeing the memory in the error path.

[akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agomm: migration: use rcu_dereference_protected when dereferencing the radix tree slot...
Mel Gorman [Thu, 13 Jan 2011 23:47:21 +0000 (15:47 -0800)]
mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration

migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
anonymous pages, as introduced by git commit
989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
migraton").  The point of the RCU protection there is part of getting a
stable reference to anon_vma and is only held for anon pages as file pages
are locked which is sufficient protection against freeing.

However, while a file page's mapping is being migrated, the radix tree is
double checked to ensure it is the expected page.  This uses
radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
triggering the following warning.

[  173.674290] ===================================================
[  173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
[  173.676016] ---------------------------------------------------
[  173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[  173.676016]
[  173.676016] other info that might help us debug this:
[  173.676016]
[  173.676016]
[  173.676016] rcu_scheduler_active = 1, debug_locks = 0
[  173.676016] 1 lock held by hugeadm/2899:
[  173.676016]  #0:  (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
[  173.676016]
[  173.676016] stack backtrace:
[  173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
[  173.676016] Call Trace:
[  173.676016]  [<c128cc01>] ? printk+0x14/0x1b
[  173.676016]  [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
[  173.676016]  [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
[  173.676016]  [<c10e41ad>] migrate_page+0x23/0x39
[  173.676016]  [<c10e491b>] buffer_migrate_page+0x22/0x107
[  173.676016]  [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
[  173.676016]  [<c10e425d>] move_to_new_page+0x9a/0x1ae
[  173.676016]  [<c10e47e6>] migrate_pages+0x1e7/0x2fa

This patch introduces radix_tree_deref_slot_protected() which calls
rcu_dereference_protected().  Users of it must pass in the
mapping->tree_lock that is protecting this dereference.  Holding the tree
lock protects against parallel updaters of the radix tree meaning that
rcu_dereference_protected is allowable.

[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org> [2.6.37.early]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: add compound_trans_head() helper
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:20 +0000 (15:47 -0800)]
thp: add compound_trans_head() helper

Cleanup some code with common compound_trans_head helper.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: KSM on THP
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:19 +0000 (15:47 -0800)]
thp: KSM on THP

This makes KSM full operational with THP pages.  Subpages are scanned
while the hugepage is still in place and delivering max cpu performance,
and only if there's a match and we're going to deduplicate memory, the
single hugepages with the subpage match is split.

There will be no false sharing between ksmd and khugepaged.  khugepaged
won't collapse 2m virtual regions with KSM pages inside.  ksmd also should
only split pages when the checksum matches and we're likely to split an
hugepage for some long living ksm page (usual ksm heuristic to avoid
sharing pages that get de-cowed).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: khugepaged: make khugepaged aware about madvise
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:18 +0000 (15:47 -0800)]
thp: khugepaged: make khugepaged aware about madvise

MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory.  While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.

MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).

MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: madvise(MADV_NOHUGEPAGE)
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:17 +0000 (15:47 -0800)]
thp: madvise(MADV_NOHUGEPAGE)

Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed.  Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel.  Never silently return
success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: mm: define MADV_NOHUGEPAGE
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:17 +0000 (15:47 -0800)]
thp: mm: define MADV_NOHUGEPAGE

Define MADV_NOHUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: compound_trans_order
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:16 +0000 (15:47 -0800)]
thp: compound_trans_order

Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: fix memory-failure hugetlbfs vs THP collision
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:16 +0000 (15:47 -0800)]
thp: fix memory-failure hugetlbfs vs THP collision

hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
pages and that broke THP as split_huge_page was then called on hugetlbfs
pages too.

compound_head/order was also run unsafe on THP pages that can be splitted
at any time.

All compound_head() invocations in memory-failure.c that are run on pages
that aren't pinned and that can be freed and reused from under us (while
compound_head is running) are buggy because compound_head can return a
dangling pointer, but I'm not fixing this as this is a generic
memory-failure bug not specific to THP but it applies to hugetlbfs too, so
I can fix it later after THP is merged upstream.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: add debug checks for mapcount related invariants
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:15 +0000 (15:47 -0800)]
thp: add debug checks for mapcount related invariants

Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: transparent hugepage sysfs meminfo
David Rientjes [Thu, 13 Jan 2011 23:47:14 +0000 (15:47 -0800)]
thp: transparent hugepage sysfs meminfo

Add hugepage statistics to per-node sysfs meminfo

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: scale nr_rotated to balance memory pressure
Rik van Riel [Thu, 13 Jan 2011 23:47:13 +0000 (15:47 -0800)]
thp: scale nr_rotated to balance memory pressure

Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page.  This ensures pageout scanning balance is not
distorted when there are huge pages on the LRU.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: fix anon memory statistics with transparent hugepages
Rik van Riel [Thu, 13 Jan 2011 23:47:13 +0000 (15:47 -0800)]
thp: fix anon memory statistics with transparent hugepages

Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: disable transparent hugepages by default on small systems
Rik van Riel [Thu, 13 Jan 2011 23:47:12 +0000 (15:47 -0800)]
thp: disable transparent hugepages by default on small systems

On small systems, the extra memory used by the anti-fragmentation memory
reserve and simply because huge pages are smaller than large pages can
easily outweigh the benefits of less TLB misses.

A less obvious concern is if run on a NUMA machine with asymmetric node
sizes and one of them is very small.  The reserve could make the node
unusable.

In case of the crashdump kernel, OOMs have been observed due to the
anti-fragmentation memory reserve taking up a large fraction of the
crashdump image.

This patch disables transparent hugepages on systems with less than 1GB of
RAM, but the hugepage subsystem is fully initialized so administrators can
enable THP through /sys if desired.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Avi Kiviti <avi@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: use compaction for all allocation orders
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:11 +0000 (15:47 -0800)]
thp: use compaction for all allocation orders

It makes no sense not to enable compaction for small order pages as we
don't want to end up with bad order 2 allocations and good and graceful
order 9 allocations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: use compaction in kswapd for GFP_ATOMIC order > 0
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:11 +0000 (15:47 -0800)]
thp: use compaction in kswapd for GFP_ATOMIC order > 0

This takes advantage of memory compaction to properly generate pages of
order > 0 if regular page reclaim fails and priority level becomes more
severe and we don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: freeze khugepaged and ksmd
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:10 +0000 (15:47 -0800)]
thp: freeze khugepaged and ksmd

It's unclear why schedule friendly kernel threads can't be taken away by
the CPU through the scheduler itself.  It's safer to stop them as they can
trigger memory allocation, if kswapd also freezes itself to avoid
generating I/O they have too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: mmu_notifier_test_young
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:10 +0000 (15:47 -0800)]
thp: mmu_notifier_test_young

For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit).  qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...

We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.

Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.

->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.

Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
13 years agothp: don't allow transparent hugepage support without PSE
Andrea Arcangeli [Thu, 13 Jan 2011 23:47:09 +0000 (15:47 -0800)]
thp: don't allow transparent hugepage support without PSE

Archs implementing Transparent Hugepage Support must implement a function
called has_transparent_hugepage to be sure the virtual or physical CPU
supports Transparent Hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>