review.tizen.org Git - platform/kernel/linux-rpi.git/log

Merge git://git.infradead.org/battery-2.6

* git://git.infradead.org/battery-2.6:
  intel_mid_battery: Fix battery scaling
  intel_mid_battery: Fix the argument order to intel_scu_ipc_command
  olpc_battery: Fix build failure caused by sysfs changes
  Add s3c-adc-battery driver
  Intel MID platform battery driver

Fix up trivial conflicts (battery drivers added from different branches)
in drivers/power/{Kconfig,Makefile}

x86/hpet: Use the FSEC_PER_SEC constant for femto-second periods

The current computation, introduced with f12a15be63, of FSEC_PER_SEC using
the multiplication of (FSEC_PER_NSEC * NSEC_PER_SEC) is performed only
with 32bit integers on small machines, resulting in an overflow and a
*very* short intervals being programmed. An interrupt storm follows.

Note that we also have to specify FSEC_PER_SEC as being long long to
overcome the same limitations.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

timekeeping: Fix overflow in rawtime tv_nsec on 32 bit archs

The tv_nsec is a long and when added to the shifted interval it can wrap
and become negative which later causes looping problems in the
getrawmonotonic(). The edge case occurs when the system has slept for
a short period of time of ~2 seconds.

A trace printk of the values in this patch illustrate the problem:

ftrace time stamp: log
43.716079: logarithmic_accumulation: raw: 3d0913 tv_nsec d687faa
43.718513: logarithmic_accumulation: raw: 3d0913 tv_nsec da588bd
43.722161: logarithmic_accumulation: raw: 3d0913 tv_nsec de291d0
46.349925: logarithmic_accumulation: raw: 7a122600 tv_nsec e1f9ae3
46.349930: logarithmic_accumulation: raw: 1e848980 tv_nsec 8831c0e3

The kernel starts looping at 46.349925 in the getrawmonotonic() due to
the negative value from adding the raw value to tv_nsec.

A simple solution is to accumulate into a u64, and then normalize it
to a timespec_t.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
[ Reworked variable names and simplified some of the code. - John ]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MN10300: Use no_printk() for disabled gdbstub debugging functions

Use no_printk() for disabled gdbstub debugging functions to maintain side
effect checking.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Add a dummy printk function for the maintenance of unused printks

Add a dummy printk function for the maintenance of unused printks through gcc
format checking, and also so that side-effect checking is maintained too.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MN10300: Don't try and #include <linux/slab.h> in lib/inflate.c from bootloader

Don't try and #include <linux/slab.h> in lib/inflate.c from the bootloader code
as linux/slab.h hauls in function defs that aren't available in the bootloader
code and may also haul in conflicting functions.

To fix this, make the inclusion of linux/slab.h contingent on NO_INFLATE_MALLOC
as are the usages of kmalloc() and kfree().

In MN10300, this causes the following errors:

In file included from include/linux/string.h:21,
                 from include/linux/bitmap.h:8,
                 from include/linux/nodemask.h:93,
                 from include/linux/mmzone.h:16,
                 from include/linux/gfp.h:4,
                 from include/linux/slab.h:12,
                 from arch/mn10300/boot/compressed/../../../../lib/inflate.c:106,
                 from arch/mn10300/boot/compressed/misc.c:170:
/warthog/am33/linux-2.6-mn10300/arch/mn10300/include/asm/string.h:19: error: conflicting types for 'memset'
arch/mn10300/boot/compressed/misc.c:59: error: previous definition of 'memset' was here

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MN10300: Permit .GCC-command-line sections

Permit .GCC-command-line sections in modules. Otherwise modpost says things
like:

WARNING: drivers/mtd/chips/map_ram.o (.GCC-command-line): unexpected non-allocatable section.
Did you forget to use "ax"/"aw" in a .S file?
Note that for example <linux/init.h> contains
section definitions for use in .S files.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MN10300: Fix size_t and ssize_t

With the newer compilers, size_t and ssize_t are expected to be (un)signed int
rather than (un)signed long.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MN10300: Fix RTC routines

A change to the RTC routines in the MN10300 arch used set_rtc_mms() when it
meant set_rtc_mmss(). This results in an error due to a reference of an
undefined symbol.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'release' of git://git./linux/kernel/git/aegl/linux-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] Fix rwsem: RWSEM_WAITING_BIAS must not be unsigned.

Merge branch 'drm-core-next' of git://git./linux/kernel/git/airlied/drm-2.6

* 'drm-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (55 commits)
  io-mapping: move asm include inside the config option
  vgaarb: drop vga.h include
  drm/radeon: Add probing of clocks from device-tree
  drm/radeon: drop old and broken mesa warning
  drm/radeon: Fix pci_map_page() error checking
  drm: Remove count_lock for calling lastclose() after 58474713 (v2)
  drm/radeon/kms: allow FG_ALPHA_VALUE on r5xx
  drm/radeon/kms: another r6xx/r7xx CS checker fix
  DRM: Replace kmalloc/memset combos with kzalloc
  drm: expand gamma_set
  drm/edid: Split mode lists out to their own header for readability
  drm/edid: Rewrite mode parse to use the generic detailed block walk
  drm/edid: Add detailed block walk for VTB extensions
  drm/edid: Add detailed block walk for CEA extensions
  drm: Remove unused fields from drm_display_info
  drm: Use ENOENT consistently for the error return for an unmatched handle.
  drm/radeon/kms: mark 3D power states as performance
  drm: Only set DPMS once on the CRTC not after every encoder.
  drm/radeon/kms: add additional quirk for Acer rv620 laptop
  drm: Propagate error code from fb_create()
  ...

Fix up trivial conflicts in drivers/gpu/drm/drm_edid.c

[IA64] Fix rwsem: RWSEM_WAITING_BIAS must not be unsigned.

Some nice improvements were made to rwsem in commit:

424acaaeb3a3932d64a9b4bd59df6cf72c22d8f3
rwsem: wake queued readers when writer blocks on active read lock

but this change overlooked that ia64 had defined RWSEM_WAITING_BIAS
as an unsigned value, while the new code required a signed value (as
it is in every other architecture).

This fix suggested by the original patch author: Michel Lespinasse.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6

* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6:
  mmc_spi: Fix unterminated of_match_table
  of/sparc: fix build regression from of_device changes
  of/device: Replace struct of_device with struct platform_device

Merge branch 'stable/xen-swiotlb-0.8.6' of git://git./linux/kernel/git/konrad/xen

* 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  x86: Detect whether we should use Xen SWIOTLB.
  pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
  swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
  xen/mmu: inhibit vmap aliases rather than trying to clear them out
  vmap: add flag to allow lazy unmap to be disabled at runtime
  xen: Add xen_create_contiguous_region
  xen: Rename the balloon lock
  xen: Allow unprivileged Xen domains to create iomap pages
  xen: use _PAGE_IOMAP in ioremap to do machine mappings

Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
include/xen/xen-ops.h

memstick: fix hangs on unexpected device removal in mspro_blk

mspro_block_remove() is called from detect thread that first calls the
mspro_block_stop(), which stops the request queue. If we call
del_gendisk() with the queue stopped we get a deadlock.

Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memstick: init sysfs attributes

Otherwise lockdep complains.

Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmc_test: fix large memory allocation

- Fix mmc_test_alloc_mem.

- Use nr_free_buffer_pages() instead of sysinfo.totalram to determine
  total lowmem pages.

- Change variables containing memory sizes to unsigned long.

- Limit maximum test area size to 128MiB because that is the maximum MMC
  high capacity erase size (the maxmium SD allocation unit size is just
  4MiB)

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmc_test: add performance tests

mmc_test provides tests aimed at testing SD/MMC hosts. This patch adds
performance tests.

It is advantageous to have performance tests in a kernel
module like mmc_test for the following reasons:
- transfer times can be measured very accurately
- arbitrarily large transfers are possible
- the effect of contiguous vs scattered pages
can be determined

The new tests are:

23. Best-case read performance
24. Best-case write performance
25. Best-case read performance into scattered pages
26. Best-case write performance from scattered pages
27. Single read performance by transfer size
28. Single write performance by transfer size
29. Single trim performance by transfer size
30. Consecutive read performance by transfer size
31. Consecutive write performance by transfer size
32. Consecutive trim performance by transfer size

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmc_block: add support for secure discard

Secure discard is implemented by Secure Trim if the discard is unaligned
or Secure Erase otherwise.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

block: add secure discard

Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

omap_hsmmc: add erase capability

Disable the data (busy) timeout for erases and set the MMC_CAP_ERASE
capability.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmc_block: add discard support

Enable MMC to service discard requests. In the case of SD and MMC cards
that do not support trim, discards become erases. In the case of cards
(MMC) that only allow erases in multiples of erase group size, round to
the nearest completely discarded erase group.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmc: add erase, secure erase, trim and secure trim operations

SD/MMC cards tend to support an erase operation.  In addition, eMMC v4.4
cards can support secure erase, trim and secure trim operations that are
all variants of the basic erase command.

SD/MMC device attributes "erase_size" and "preferred_erase_size" have been
added.

"erase_size" is the minimum size, in bytes, of an erase operation.  For
MMC, "erase_size" is the erase group size reported by the card.  Note that
"erase_size" does not apply to trim or secure trim operations where the
minimum size is always one 512 byte sector.  For SD, "erase_size" is 512
if the card is block-addressed, 0 otherwise.

SD/MMC cards can erase an arbitrarily large area up to and
including the whole card.  When erasing a large area it may
be desirable to do it in smaller chunks for three reasons:

    1. A single erase command will make all other I/O on the card
       wait.  This is not a problem if the whole card is being erased, but
       erasing one partition will make I/O for another partition on the
       same card wait for the duration of the erase - which could be a
       several minutes.

    2. To be able to inform the user of erase progress.

    3. The erase timeout becomes too large to be very useful.
       Because the erase timeout contains a margin which is multiplied by
       the size of the erase area, the value can end up being several
       minutes for large areas.

"erase_size" is not the most efficient unit to erase (especially for SD
where it is just one sector), hence "preferred_erase_size" provides a good
chunk size for erasing large areas.

For MMC, "preferred_erase_size" is the high-capacity erase size if a card
specifies one, otherwise it is based on the capacity of the card.

For SD, "preferred_erase_size" is the allocation unit size specified by
the card.

"preferred_erase_size" is in bytes.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: fix writeback_in_progress()

Commit 83ba7b071f3 ("writeback: simplify the write back thread queue")
broke writeback_in_progress() as in that commit we started to remove work
items from the list at the moment we start working on them and not at the
moment they are finished.  Thus if the flusher thread was doing some work
but there was no other work queued, writeback_in_progress() returned
false.  This could in particular cause unnecessary queueing of background
writeback from balance_dirty_pages() or writeout work from
writeback_sb_if_idle().

This patch fixes the problem by introducing a bit in the bdi state which
indicates that the flusher thread is processing some work and uses this
bit for writeback_in_progress() test.

NOTE: Both callsites of writeback_in_progress() (namely,
writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
need a different information than what writeback_in_progress() provides.
They would need to know whether *the kind of writeback they are going to
submit* is already queued.  But this information isn't that simple to
provide so let's fix writeback_in_progress() for the time being.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

writeback: merge for_kupdate and !for_kupdate cases

Unify the logic for kupdate and non-kupdate cases.  There won't be
starvation because the inodes requeued into b_more_io will later be
spliced _after_ the remaining inodes in b_io, hence won't stand in the way
of other inodes in the next run.

It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when.  The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or may exclude the inode from
the data integrity sync operation (which checks timestamp to avoid extra
work and livelock).

===
How the redirty_tail() comes about:

It was a long story..  This redirty_tail() was introduced with
wbc.more_io.  The initial patch for more_io actually does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arised:

reiserfs:
        http://lkml.org/lkml/2007/10/23/93

jfs:
        commit 29a424f28390752a4ca2349633aaacc6be494db5
        JFS: clear PAGECACHE_TAG_DIRTY for no-write pages

ext2:
        http://www.spinics.net/linux/lists/linux-ext4/msg04762.html

They are all old bugs hidden in various filesystems that become "visible"
with the more_io patch.  At the time, the ext2 bug is thought to be
"trivial", so not fixed.  Instead the following updated more_io patch with
redirty_tail() is merged:

http://www.spinics.net/linux/lists/linux-ext4/msg04507.html

This will in general prevent 100% on ext2 and possibly other unknown FS bugs.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

writeback: fix queue_io() ordering

This was not a bug, since b_io is empty for kupdate writeback. The next
patch will do requeue_io() for non-kupdate writeback, so let's fix it.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

writeback: don't redirty tail an inode with dirty pages

Avoid delaying writeback for an expire inode with lots of dirty pages, but
no active dirtier at the moment. Previously we only do that for the
kupdate case.

Any filesystem that does delayed allocation or unwritten extent conversion
after IO completion will cause this - for example, XFS.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

writeback: add comment to the dirty limit functions

Document global_dirty_limits() and bdi_dirty_limit().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

writeback: avoid unnecessary calculation of bdi dirty thresholds

Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

writeback: balance_dirty_pages(): reduce calls to global_page_state

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
times, dropping the fasted & slowest values then taking the average &
standard deviation

average (s.d.) in millions (10^6)
2.6.31-rc8 648.6 (14.6)
+patch 620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
the counters to apply the dirty_threshold and moving this check up into
balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit tests
allows the pdflush test after the loop to use the local copies of the
counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Fengguang:

This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
avoid exceeding the dirty limit.  Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip.  A simple dirty limit check
would be enough.

The check is necessary because, in principle we should throttle everything
calling balance_dirty_pages() when we're over the total limit, as said by
Peter.

We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limit.  The global limit check is added in
place of clip_bdi_dirty_limit() for safety and not intended as a behavior
change.  The bdi limits should be tight enough to keep all dirty pages
under the global limit at most time; occasional small exceeding should be
OK though.  The change makes the logic more obvious: the global limit is
the ultimate goal and shall be always imposed.

We may now start background writeback work based on outdated conditions.
That's safe because the bdi flush thread will (and have to) double check
the states.  It reduces overall overheads because the test based on old
states still have good chance to be right.

[akpm@linux-foundation.org] fix uninitialized dirty_exceeded
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

parisc: fix wrong page aligned size calculation in ioremapping code

parisc __ioremap(): fix off-by-one error in page alignment of allocation
size for sizes where size%PAGE_SIZE==1.

Signed-off-by: Florian Zumbiehl <florz@florz.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

score: fix dereference of NULL pointer in local_flush_tlb_page()

Don't dereference vma if it's NULL.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

pc8736x_gpio: depends on X86_32

Fix kconfig dependency warning for PC8736x_GPIO by restricting it to
X86_32.

warning: (SCx200_GPIO && SCx200 || PC8736x_GPIO && X86) selects NSC_GPIO which has unmet direct dependencies (X86_32)

NSC_GPIO is X86_32 only. The other driver (SCx200_GPIO) that selects
NSC_GPIO is X86_32 only (indirectly, since SCx200 depends on X86_32), so
limit this driver also.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jordan Crouse <jordan.crouse@amd.com>
Cc: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: fix fatal kernel-doc error

Fix a fatal kernel-doc error due to a #define coming between a function's
kernel-doc notation and the function signature. (kernel-doc cannot handle
this)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

acpi: fix bogus preemption logic

The ACPI_PREEMPTION_POINT() logic was introduced in commit 8bd108d
(ACPICA: add preemption point after each opcode parse). The follow up
commits abe1dfab6, 138d15692, c084ca70 tried to fix the preemption logic
back and forth, but nobody noticed that the usage of
in_atomic_preempt_off() in that context is wrong.

The check which guards the call of cond_resched() is:

if (!in_atomic_preempt_off() && !irqs_disabled())

in_atomic_preempt_off() is not intended for general use as the comment
above the macro definition clearly says:

* Check whether we were atomic before we did preempt_disable():
* (used by the scheduler, *after* releasing the kernel lock)

On a CONFIG_PREEMPT=n kernel the usage of in_atomic_preempt_off() works by
accident, but with CONFIG_PREEMPT=y it's just broken.

The whole purpose of the ACPI_PREEMPTION_POINT() is to reduce the latency
on a CONFIG_PREEMPT=n kernel, so make ACPI_PREEMPTION_POINT() depend on
CONFIG_PREEMPT=n and remove the in_atomic_preempt_off() check.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16210

[akpm@linux-foundation.org: fix build]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Francois Valenduc <francois.valenduc@tvcablenet.be>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kernel/kfifo.c: add handling of chained scatterlists

The current kfifo scatterlist implementation will not work with chained
scatterlists. It assumes that struct scatterlist arrays are allocated
contiguously, which is not the case when chained scatterlists (struct
sg_table) are in use.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

io-mapping: move asm include inside the config option

nouveau starting using these APIs, the first on non-x86 hw, and this
include isn't required on anything with real amounts of vmalloc space.

this fixes a build problem on powerpc.

Signed-off-by: Dave Airlie <airlied@redhat.com>

vgaarb: drop vga.h include

We don't actually need this include on any platform.

built on powerpc + x86, reported on m68k.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

drm/radeon: Add probing of clocks from device-tree

When we find no ROM we understand and a device-tree is present, see
if we can retreive clock info from there.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

drm/radeon: drop old and broken mesa warning

This never really got fixed in mesa, and the kernel deals with the problem
just fine, so don't got reporting things that confuse people.

Signed-off-by: Dave Airlie <airlied@redhat.com>

drm/radeon: Fix pci_map_page() error checking

0 is a valid DMA address from pci_map_page(), use pci_dma_mapping_error()
instead to check for errors

[airlied: fix warning + two other places with errors.]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

drm: Remove count_lock for calling lastclose() after 58474713 (v2)

When removing of the BKL the locking around lastclose() was rearranged
and resulted in the holding of the open_count spinlock over the call
into drm_lastclose(). The drivers were not ready for this path to be
atomic - it may indeed involve long waits to release old objects and
cleanup the GPU - and so we ended up scheduling whilst atomic.

[   54.625598] BUG: scheduling while atomic: X/3546/0x00000002
[   54.625600] Modules linked in: sco bridge stp llc input_polldev rfcomm bnep l2cap crc16 sch_sfq ipv6 md_mod acpi_cpufreq mperf cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod btusb bluetooth usbhid hid zaurus cdc_ether usbnet mii cdc_wdm cdc_acm uvcvideo videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_conexant arc4 pcmcia ecb snd_hda_intel joydev sdhci_pci sdhci snd_hda_codec tpm_tis firewire_ohci mmc_core e1000e uhci_hcd thinkpad_acpi nvram yenta_socket pcmcia_rsrc pcmcia_core tpm wmi sr_mod firewire_core iwlagn ehci_hcd snd_hwdep snd_pcm usbcore tpm_bios thermal led_class snd_timer iwlcore snd soundcore ac snd_page_alloc pcspkr psmouse serio_raw battery sg mac80211 evdev cfg80211 i2c_i801 iTCO_wdt iTCO_vendor_support cdrom processor crc_itu_t rfkill xfs exportfs sd_mod crc_t10dif ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
[   54.625663] Pid: 3546, comm: X Not tainted 2.6.35-04771-g1787985 #301
[   54.625665] Call Trace:
[   54.625671]  [<ffffffff8102d599>] __schedule_bug+0x57/0x5c
[   54.625675]  [<ffffffff81384141>] schedule+0xe5/0x832
[   54.625679]  [<ffffffff81163e77>] ? put_dec+0x20/0x3c
[   54.625682]  [<ffffffff81384dd4>] schedule_timeout+0x275/0x29f
[   54.625686]  [<ffffffff810455e1>] ? process_timeout+0x0/0xb
[   54.625688]  [<ffffffff81384e17>] schedule_timeout_uninterruptible+0x19/0x1b
[   54.625691]  [<ffffffff81045893>] msleep+0x16/0x1d
[   54.625695]  [<ffffffff812a2e53>] i9xx_crtc_dpms+0x273/0x2ae
[   54.625698]  [<ffffffff812a18be>] intel_crtc_dpms+0x28/0xe7
[   54.625702]  [<ffffffff811ec0fa>] drm_helper_disable_unused_functions+0xf0/0x118
[   54.625705]  [<ffffffff811ecde3>] drm_crtc_helper_set_config+0x644/0x7c8
[   54.625708]  [<ffffffff811f12dd>] ? drm_copy_field+0x40/0x50
[   54.625711]  [<ffffffff811ebca2>] drm_fb_helper_force_kernel_mode+0x3e/0x85
[   54.625713]  [<ffffffff811ebcf2>] drm_fb_helper_restore+0x9/0x24
[   54.625717]  [<ffffffff81290a41>] i915_driver_lastclose+0x2b/0x5c
[   54.625720]  [<ffffffff811f14a7>] drm_lastclose+0x44/0x2ad
[   54.625722]  [<ffffffff811f1ed2>] drm_release+0x5c6/0x609
[   54.625726]  [<ffffffff810d1275>] fput+0x109/0x1c7
[   54.625728]  [<ffffffff810ce5e4>] filp_close+0x61/0x6b
[   54.625731]  [<ffffffff810ce680>] sys_close+0x92/0xd4
[   54.625734]  [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

v2: The spinlock is actually superfluous as access to open_count is
entirely serialised by drm_global_mutex and so can be dropped. The
count_lock spinlock instead appears to be used to protect access to
dev->buf_alloc and dev->buf_use.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Dave Airlie <airlied@redhat.com>

drm/radeon/kms: allow FG_ALPHA_VALUE on r5xx

This is a CS checker fix. I need this for FP16 alpha-test.

Signed-off-by: Marek Olšák <maraeo@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

drm/radeon/kms: another r6xx/r7xx CS checker fix

add default case for buffer formats

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Cc: Andre Maasikas <amaasikas@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

DRM: Replace kmalloc/memset combos with kzalloc

Currently most, if not all, memory allocation in drm_bufs.c is followed by initializing the memory with 0.

Replace the use of kmalloc+memset with kzalloc.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  isofs: Fix lseek() to position beyond 4 GB
  vfs: remove unused MNT_STRICTATIME
  vfs: show unreachable paths in getcwd and proc
  vfs: only add " (deleted)" where necessary
  vfs: add prepend_path() helper
  vfs: __d_path: dont prepend the name of the root dentry
  ia64: perfmon: add d_dname method
  vfs: add helpers to get root and pwd
  cachefiles: use path_get instead of lone dget
  fs/sysv/super.c: add support for non-PDP11 v7 filesystems
  V7: Adjust sanity checks for some volumes
  Add v7 alias
  v9fs: fixup for inode_setattr being removed

Manual merge to take Al's version of the fs/sysv/super.c file: it merged
cleanly, but Al had removed an unnecessary header include, so his side
was better.

Merge git://git./linux/kernel/git/pkl/squashfs-linus

* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: fix checkpatch.pl warnings
  Squashfs: fix filename typo
  Squashfs: update Kconfig and documentation for LZO
  Squashfs: fix block size use in LZO decompressor
  Squashfs: Add LZO compression support
  squashfs: fix filename in header comment
  Squashfs: Make XATTR config name consistent with other file systems
  squashfs: fix compiler inline warning

Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Fix groups code when num_devices is not divisible by group_width
  exofs: Remove useless optimization
  exofs: exofs_file_fsync and exofs_file_flush correctness
  exofs: Remove superfluous dependency on buffer_head and writeback

Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
  ceph: generalize mon requests, add pool op support
  ceph: only queue async writeback on cap revocation if there is dirty data
  ceph: do not ignore osd_idle_ttl mount option
  ceph: constify dentry_operations
  ceph: whitespace cleanup
  ceph: add flock/fcntl lock support
  ceph: define on-wire types, constants for file locking support
  ceph: add CEPH_FEATURE_FLOCK to the supported feature bits
  ceph: support v2 reconnect encoding
  ceph: support v2 client_caps encoding
  ceph: move AES iv definition to shared header
  ceph: fix decoding of pool snap info
  ceph: make ->sync_fs not wait if wait==0
  ceph: warn on missing snap realm
  ceph: print useful error message when crush rule not found
  ceph: use %pU to print uuid (fsid)
  ceph: sync header defs with server code
  ceph: clean up header guards
  ceph: strip misleading/obsolete version, feature info
  ceph: specify supported features in super.h
  ...

Merge branch 'msm-video' of git://codeaurora.org/quic/kernel/dwalker/linux-msm

* 'msm-video' of git://codeaurora.org/quic/kernel/dwalker/linux-msm:
video: msm: Fix section mismatch in mddi.c.
drivers: video: msm: drop some unused variables

Merge branch 'ixp4xx' of git://git./linux/kernel/git/chris/linux-2.6

* 'ixp4xx' of git://git.kernel.org/pub/scm/linux/kernel/git/chris/linux-2.6:
  IXP4xx: Fix LL debugging on little-endian CPU.
  IXP4xx: Fix sparse warnings in I/O primitives.
  IXP4xx: Make mdio_bus struct static in the Ethernet driver.
  IXP4xx: Fix ixp4xx_crypto little-endian operation.
  IXP4xx: Prevent HSS transmitter lockup by disabling FRaMe signals.
  ixp4xx/vulcan: add PCI support
  ixp4xx: base support for Arcom Vulcan

Merge branch 'for-linus' of /home/rmk/linux-2.6-arm

* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (226 commits)
  ARM: 6323/1: cam60: don't use __init for cam60_spi_{flash_platform_data,partitions}
  ARM: 6324/1: cam60: move cam60_spi_devices to .init.data
  ARM: 6322/1: imx/pca100: Fix name of spi platform data
  ARM: 6321/1: fix syntax error in main Kconfig file
  ARM: 6297/1: move U300 timer to dynamic clock lookup
  ARM: 6296/1: clock U300 intcon and timer properly
  ARM: 6295/1: fix U300 apb_pclk split
  ARM: 6306/1: fix inverted MMC card detect in U300
  ARM: 6299/1: errata: TLBIASIDIS and TLBIMVAIS operations can broadcast a faulty ASID
  ARM: 6294/1: etm: do a dummy read from OSSRR during initialization
  ARM: 6292/1: coresight: add ETM management registers
  ARM: 6288/1: ftrace: document mcount formats
  ARM: 6287/1: ftrace: clean up mcount assembly indentation
  ARM: 6286/1: fix Thumb-2 decompressor broken by "Auto calculate ZRELADDR"
  ARM: 6281/1: video/imxfb.c: allow usage without BACKLIGHT_CLASS_DEVICE
  ARM: 6280/1: imx: Fix build failure when including <mach/gpio.h> without <linux/spinlock.h>
  ARM: S5PV210: Fix on missing s3c-sdhci card detection method for hsmmc3
  ARM: S5P: Fix on missing S5P_DEV_FIMC in plat-s5p/Kconfig
  ARM: S5PV210: Override FIMC driver name on Aquila board
  ARM: S5PC100: enable FIMC on SMDKC100
  ...

Fix up conflicts in arch/arm/mach-{s5pc100,s5pv210}/cpu.c due to
different subsystem 'setname' calls, and trivial port types in
include/linux/serial_core.h

lib/decompress_bunzip2.c: fix checkstack warning

Fix checkstack error:

lib/decompress_bunzip2.c: In function `get_next_block':
lib/decompress_bunzip2.c:511: warning: the frame size of 1932 bytes is larger than 1024 bytes

byteCount, symToByte, and mtfSymbol cannot be declared static or allocated
dynamically so place them in the bunzip_data struct.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kfifo: add example files to the kernel sample directory

Add four examples to the kernel sample directory.

It shows how to handle:
- a byte stream fifo
- a integer type fifo
- a dynamic record sized fifo
- the fifo DMA functions

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kfifo: replace the old non generic API

Simply replace the whole kfifo.c and kfifo.h files with the new generic
version and fix the kerneldoc API template file.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kfifo: add the new generic kfifo API

Add the new version of the kfifo API files kfifo.c and kfifo.h.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kfifo: fix kfifo miss use of nozami.c

There are different types of a fifo which can not handled in C without a
lot of overhead.  So i decided to write the API as a set of macros, which
is the only way to do a kind of template meta programming without C++.
This macros handles the different types of fifos in a transparent way.

There are a lot of benefits:

- Compile time handling of the different fifo types
- Better performance (a save put or get of an integer does only generate
  9 assembly instructions on a x86)
- Type save
- Cleaner interface, the additional kfifo_..._rec() functions are gone
- Easier to use
- Less error prone
- Different types of fifos: it is now possible to define a int fifo or
  any other type. See below for an example.
- Smaller footprint for none byte type fifos
- No need of creating a second hidden variable, like in the old DEFINE_KFIFO

The API was not changed.

There are now real in place fifos where the data space is a part of the
structure.  The fifo needs now 20 byte plus the fifo space.  Dynamic
assigned or allocated create a little bit more code.

Most of the macros code will be optimized away and simple generate a
function call.  Only the really small one generates inline code.

Additionally you can now create fifos for any data type, not only the
"unsigned char" byte streamed fifos.

There is also a new kfifo_put and kfifo_get function, to handle a single
element in a fifo.  This macros generates inline code, which is lit bit
larger but faster.

I know that this kind of macros are very sophisticated and not easy to
maintain.  But i have all tested and it works as expected.  I analyzed the
output of the compiler and for the x86 the code is as good as hand written
assembler code.  For the byte stream fifo the generate code is exact the
same as with the current kfifo implementation.  For all other types of
fifos the code is smaller before, because the interface is easier to use.

The main goal was to provide an API which is very intuitive, save and easy
to use.  So linux will get now a powerful fifo API which provides all what
a developer needs.  This will save in the future a lot of kernel space,
since there is no need to write an own implementation.  Most of the device
driver developers need a fifo, and also deep kernel development will gain
benefit from this API.

Here are the results of the text section usage:

Example 1:
                        kfifo_put/_get  kfifo_in/out    current kfifo
dynamic allocated       0x000002a8      0x00000291      0x00000299
in place                0x00000291      0x0000026e      0x00000273

kfifo.c                 new             old
text section size       0x00000be5      0x000008b2

As you can see, kfifo_put/kfifo_get creates a little bit more code than
kfifo_in/kfifo_out, but it is much faster (the code is inline).

The code is complete hand crafted and optimized.  The text section size is
as small as possible.  You get all the fifo handling in only 3 kb.  This
includes type safe fix size records, dynamic records and DMA handling.

This should be the final version. All requested features are implemented.

Note: Most features of this API doesn't have any users.  All functions
which are not used in the next 9 months will be removed.  So, please adapt
your drivers and other sources as soon as possible to the new API and post
it.

This are the features which are currently not used in the kernel:

kfifo_to_user()
kfifo_from_user()
kfifo_dma_....() macros
kfifo_esize()
kfifo_recsize()
kfifo_put()
kfifo_get()

The fixed size record elements, exclude "unsigned char" fifo's and the
variable size records fifo's

This patch:

User of the kernel fifo should never bypass the API and directly access
the fifo structure.  Otherwise it will be very hard to maintain the API.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kfifo: kfifo_is_{full,empty} should return bools, not ints

For consistency with other kfifo routines, return bool, not int.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/sysv/super.c: add support for non-PDP11 v7 filesystems

This adds byte order autodetection (of PDP-11 and LE filesystems). No
attempt is made to detect big-endian filesystems -- were there any?
Tested with PDP-11 v7 filesystems and PC-IX maintenance floppy.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/sysv: v7: adjust sanity checks for some volumes

Newly mkfs-ed filesystems from Seventh Edition have last modification time
set to zero, but are otherwise perfectly valid.

Also, tighten up other sanity checks to filter out most filesystems with

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/sysv: add v7 alias

So that the module gets autoloaded when a v7 filesystem is mounted.

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kexec: return -EFAULT on copy_to_user() failures

copy_to/from_user() returns the number of bytes remaining to be copied.
It never returns a negative value. The correct return code is -EFAULT and
not -EIO.

All the callers check for non-zero returns so that's Ok, but the return
code is passed to the user so we should fix this.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Kagstrom <simon.kagstrom@netinsight.net>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

parport_serial: use the PCI IRQ if offered

Commit 51dcdfe ("parport: Use the PCI IRQ if offered") added IRQ support
for PCI parallel port devices handled by parport_pc, but turned it off for
parport_serial, despite a printk() message to the contrary.

Signed-off-by: Fr?d?ric Bri?re <fbriere@fbriere.net>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

lib/bug.c: add oops end marker to WARN implementation

We are missing the oops end marker for the exception based WARN implementation
in lib/bug.c. This is useful for logfile analysis tools.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

lib/bug.c: make WARN implementation match the kernel/panic.c one

There are a few issues with the exception based WARN implementation in
lib/bug.c:

- Inconsistent printk flags. The "cut here" line is printed at KERN_EMERG, so
  the console and all logged in users see the single line:

------------[ cut here ]------------

  for each WARN. Fix this so we print everything at KERN_WARNING to match the
  kernel/panic.c version.

- The lib/bug.c WARN would print "Badness at". Change it to match the
  kernel/panic.c version which prints "WARNING: at".

- Print the list of modules, similar to kernel/panic.c of modules, similar to
  kernel/panic.c

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

panic: keep blinking in spite of long spin timer mode

To keep panic_timeout accuracy when running under a hypervisor, the
current implementation only spins on long time (1 second) calls to mdelay.
That brings a good effect, but the problem is the keyboard LEDs don't
blink at all on that situation.

This patch changes to call to panic_blink_enter() between every mdelay and
keeps blinking in spite of long spin timer mode.

The time to call to mdelay is now 100ms. Even this change will keep
panic_timeout accuracy enough when running under a hypervisor.

Signed-off-by: TAMUKI Shoichi <tamuki@linet.gr.jp>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

afs: destroy work queue on init failure

We can clean up the work queue on this error path. This function is
called from afs_init().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dma-mapping: add DMA_xxBIT_MASK to feature-removal-schedule.txt

DMA_xxBIT_MASK macros were marked as deprecated in June 2009. One more
year is long enough, I think.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

pci: add PCI DMA unamp state API to feature-removal-schedule.txt

It was replaced with the DMA unamp state API (which can be used for
any bus).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Documentation: DMA-API-HOWTO.txt: add multiple types of IOMMUs support

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dma-mapping: remove dma_is_consistent API

Architectures implement dma_is_consistent() in different ways (some
misinterpret the definition of API in DMA-API.txt).  So it hasn't been so
useful for drivers.  We have only one user of the API in tree.  Unlikely
out-of-tree drivers use the API.

Even if we fix dma_is_consistent() in some architectures, it doesn't look
useful at all.  It was invented long ago for some old systems that can't
allocate coherent memory at all.  It's better to export only APIs that are
definitely necessary for drivers.

Let's remove this API.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

scsi: 53c700: remove dma_is_consistent usage

This driver is the only user of dma_is_consistent(). We plan to remove this
API.

The driver uses the API in the following way:

BUG_ON(!dma_is_consistent(hostdata->dev, pScript) && L1_CACHE_BYTES < dma_get_cache_alignment());

The above code tries to see if L1_CACHE_BYTES is greater than
dma_get_cache_alignment() on sysmtes that can not allocate coherent memory
(some old systems can't).

James Bottomley exmplained that this is necesary because the driver packs the
set of mailboxes into a single coherent area and separates the different
usages by a L1 cache stride. So it's fatal if the dma

He also pointed out that we can kill this checking because we don't hit this
BUG_ON on all architectures that actually use the driver.

(akpm: stolen from the scsi tree because
dma-mapping-remove-dma_is_consistent-api.patch needs it)

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dma-mapping: parisc: set ARCH_DMA_MINALIGN

Architectures that handle DMA-non-coherent memory need to set
ARCH_DMA_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the
buffer doesn't share a cache with the others.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dma-mapping: unify dma_get_cache_alignment implementations

dma_get_cache_alignment returns the minimum DMA alignment.  Architectures
defines it as ARCH_DMA_MINALIGN (formally ARCH_KMALLOC_MINALIGN).  So we
can unify dma_get_cache_alignment implementations.

Note that some architectures implement dma_get_cache_alignment wrongly.
dma_get_cache_alignment() should return the minimum DMA alignment.  So
fully-coherent architectures should return 1.  This patch also fixes this
issue.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dma-mapping: rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN

Now each architecture has the own dma_get_cache_alignment implementation.

dma_get_cache_alignment returns the minimum DMA alignment.  Architectures
define it as ARCH_KMALLOC_MINALIGN (it's used to make sure that malloc'ed
buffer is DMA-safe; the buffer doesn't share a cache with the others).  So
we can unify dma_get_cache_alignment implementations.

This patch:

dma_get_cache_alignment() needs to know if an architecture defines
ARCH_KMALLOC_MINALIGN or not (needs to know if architecture has DMA
alignment restriction).  However, slab.h define ARCH_KMALLOC_MINALIGN if
architectures doesn't define it.

Let's rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN.
ARCH_KMALLOC_MINALIGN is used only in the internals of slab/slob/slub
(except for crypto).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

edac: mpc85xx: add support for new MPCxxx/Pxxxx EDAC controllers

Simply add proper IDs into the device table.

Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Dave Jiang <djiang@mvista.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

edac: i5400: improve handling of pci_enable_device() return value

-EIO is not the only error code that pci_enable_device() may return, also
the set of errors can be enhanced in future. We should compare return
code with zero, not with concrete error value.

Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Jeff Roberson <jroberson@jroberson.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

edac: i5000: improve handling of pci_enable_device() return value

-EIO is not the only error code that pci_enable_device() may return, also
the set of errors can be enhanced in future. We should compare return
code with zero, not with concrete error value.

Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Jeff Roberson <jroberson@jroberson.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

edac: add wissing pieces from MPC85xx -> FSL_SOC_BOOKE

In 5753c082f66eca5be81f6bda85c1718c5eea6ada ("powerpc/85xx: Kconfig
cleanup") menuconfig MPC85xx was replaced by FSL_SOC_BOOKE but some
references insider the code were not adjusted accordingly. This patch
adresses these missing pieces.

Signed-off-by: Christoph Egger <siccegge@cs.fau.de>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Scott Wood <scottwood@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

pids: alloc_pidmap: remove the unnecessary boundary checks

alloc_pidmap() calculates max_scan so that if the initial offset != 0 we
inspect the first map->page twice.  This is correct, we want to find the
unused bits < offset in this bitmap block.  Add the comment.

But it doesn't make any sense to stop the find_next_offset() loop when we
are looking into this map->page for the second time.  We have already
already checked the bits >= offset during the first attempt, it is fine to
do this again, no matter if we succeed this time or not.

Remove this hard-to-understand code.  It optimizes the very unlikely case
when we are going to fail, but slows down the more likely case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

pids: fix a race in pid generation that causes pids to be reused immediately

A program that repeatedly forks and waits is susceptible to having the
same pid repeated, especially when it competes with another instance of
the same program.  This is really bad for bash implementation.
Furthermore, many shell scripts assume that pid numbers will not be used
for some length of time.

Race Description:

A                                    B

// pid == offset == n                // pid == offset == n + 1
test_and_set_bit(offset, map->page)
                                     test_and_set_bit(offset, map->page);
                                     pid_ns->last_pid = pid;
pid_ns->last_pid = pid;
                                     // pid == n + 1 is freed (wait())

                                     // Next fork()...
                                     last = pid_ns->last_pid; // == n
                                     pid = last + 1;

Code to reproduce it (Running multiple instances is more effective):

#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

// The distance mod 32768 between two pids, where the first pid is expected
// to be smaller than the second.
int PidDistance(pid_t first, pid_t second) {
  return (second + 32768 - first) % 32768;
}

int main(int argc, char* argv[]) {
  int failed = 0;
  pid_t last_pid = 0;
  int i;
  printf("%d\n", sizeof(pid_t));
  for (i = 0; i < 10000000; ++i) {
    if (i % 32786 == 0)
      printf("Iter: %d\n", i/32768);
    int child_exit_code = i % 256;
    pid_t pid = fork();
    if (pid == -1) {
      fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno);
      exit(1);
    }
    if (pid == 0) {
      // Child
      exit(child_exit_code);
    } else {
      // Parent
      if (i > 0) {
        int distance = PidDistance(last_pid, pid);
        if (distance == 0 || distance > 30000) {
          fprintf(stderr,
                  "Unexpected pid sequence: previous fork: pid=%d, "
                  "current fork: pid=%d for iteration=%d.\n",
                  last_pid, pid, i);
          failed = 1;
        }
      }
      last_pid = pid;
      int status;
      int reaped = wait(&status);
      if (reaped != pid) {
        fprintf(stderr,
                "Wait return value: expected pid=%d, "
                "got %d, iteration %d\n",
                pid, reaped, i);
        failed = 1;
      } else if (WEXITSTATUS(status) != child_exit_code) {
        fprintf(stderr,
                "Unexpected exit status %x, iteration %d\n",
                WEXITSTATUS(status), i);
        failed = 1;
      }
    }
  }
  exit(failed);
}

Thanks to Ted Tso for the key ideas of this implementation.

Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

partitions: fix sometimes unreadable partition strings

Fix this garbage happening quite often:

==> sda:
scsi 3:0:0:0: CD-ROM TOSHIBA
==> sda1 sda2 sda3 sda4 <sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
^^^
Uniform CD-ROM driver Revision: 3.20
sr 3:0:0:0: Attached scsi CD-ROM sr0
==> sda5 sda6 sda7 >

Make "sda: sda1 ..." lines actually lines.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cs5535-mfgpt: reuse timers that have never been set up

The MFGPT hardware may be set up only once, therefore
cs5535_mfgpt_free_timer() didn't re-set the timer's "avail" bit. However
if a timer is freed before it has actually been in use then it may be made
available again.

Signed-off-by: Jens Rottmann <JRottmann@LiPPERTEmbedded.de>
Acked-by: Andres Salomon <dilinger@queued.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jordan Crouse <jordan@cosmicpenguin.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/char/n_gsm.c: add missing spin_unlock_irqrestore

Add a spin_unlock_irqrestore missing on the error path.  Converting the
return to break leads to the spin_unlock_irqrestore at the end of the
function.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression E1;
@@

* spin_lock_irqsave(E1,...);
  <+... when != E1
  if (...) {
    ... when != E1
*   return ...;
  }
  ...+>
* spin_unlock_irqrestore(E1,...);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipmi: print info for spmi and smbios paths like acpi and pci

Print out the reg spacing and size for spmi and smbios so BIOS developers
can make them consistent.

Also remove extra PFX on the duplicating path.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Corey Minyard <minyard@acm.org>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipmi: fix memleaking for add_smi when duplicating happen

Free the temporary info struct when we have duplicated ones.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Corey Minyard <minyard@acm.org>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/char/ipmi/ipmi_si_intf.c: fix warning: variable 'addr_space' set but not used

Fix a warning message generated by GCC, and also updates a web address
pointing to a pdf containing information.

CC [M] drivers/char/ipmi/ipmi_si_intf.o
drivers/char/ipmi/ipmi_si_intf.c: In function 'try_init_spmi':
drivers/char/ipmi/ipmi_si_intf.c:2016:8: warning: variable 'addr_space' set but not used

Signed-off-by: Sergey V. <sftp.mtuci@gmail.com>
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

procfs: simplify conditional processing of fs/proc.o.

Since the entire fs/proc directory is conditionally included based on
CONFIG_PROC_FS, it's redundant to check that same variable within that
directory.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

signalfd: fill in ssi_int for posix timers and message queues

If signalfd is used to consume a signal generated by a POSIX interval
timer or POSIX message queue, the ssi_int field does not reflect the data
(sigevent->sigev_value) supplied to timer_create(2) or mq_notify(3).  (The
ssi_ptr field, however, is filled in.)

This behavior differs from signalfd's treatment of sigqueue-generated
signals -- see the default case in signalfd_copyinfo.  It also gives
results that differ from the case when a signal is handled conventionally
via a sigaction-registered handler.

So, set signalfd_siginfo->ssi_int in the remaining cases (__SI_TIMER,
__SI_MESGQ) where ssi_ptr is set.

akpm: a non-back-compatible change.  Merge into -stable to minimise the
number of kernels which are in the field and which miss this feature.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ptrace: optimize exit_ptrace() for the likely case

exit_ptrace() takes tasklist_lock unconditionally.  We need this lock to
avoid the race with ptrace_traceme(), it acts as a barrier.

Change its caller, forget_original_parent(), to call exit_ptrace() under
tasklist_lock.  Change exit_ptrace() to drop and reacquire this lock if
needed.

This allows us to add the fastpath list_empty(ptraced) check.  In the
likely no-tracees case exit_ptrace() just returns and we avoid the lock()
+ unlock() sequence.

"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> suggested to add this
check, and he reports that this change adds about 11% improvement in some
tests.

Suggested-and-tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: convert to use zone_to_nid() from bare zone->zone_pgdat->node_id

We have zone_to_nid(). this patch convert all existing users of
zone->zone_pgdat->node_id.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: remove nid and zid argument from mem_cgroup_soft_limit_reclaim()

mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid
and zid can be calculated from zone. So remove it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: mem_cgroup_shrink_node_zone() doesn't need sc.nodemask

Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus
it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
use it at all.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: kill unnecessary initialization in mem_cgroup_shrink_node_zone()

sc.nr_reclaimed and sc.nr_scanned have already been initialized few lines
above "struct scan_control sc = {}" statement.

So, This patch remove this unnecessary code.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: sc.nr_to_reclaim should be initialized

Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0.
It mean shrink_zone() only scan 32 pages and immediately return even if
it doesn't reclaim any pages.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: avoid css_get()

Now, memory cgroup increments css(cgroup subsys state)'s reference count
per a charged page.  And the reference count is kept until the page is
uncharged.  But this has 2 bad effect.

1. Because css_get/put calls atomic_inc()/dec, heavy call of them
    on large smp will not scale well.
2. Because css's refcnt cannot be in a state as "ready-to-release",
    cgroup's notify_on_release handler can't work with memcg.
3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.

This has been a problem since the 1st merge of memcg.

This is a trial to remove css's refcnt per a page. Even if we remove
refcnt, pre_destroy() does enough synchronization as
  - check res->usage == 0.
  - check no pages on LRU.

This patch removes css's refcnt per page.  Even after this patch, at the
1st look, it seems css_get() is still called in try_charge().

But the logic is.

  - If a memcg of mm->owner is cached one, consume_stock() will work.
    At success, return immediately.
  - If consume_stock returns false, css_get() is called and go to
    slow path which may be blocked. At the end of slow path,
    css_put() is called and restart from the start if necessary.

So, in the fast path, we don't call css_get() and can avoid access to
shared counter. This patch can make the most possible case fast.

Here is a result of multi-threaded page fault benchmark.

[Before]
    25.32%  multi-fault-all  [kernel.kallsyms]      [k] clear_page_c
     9.30%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
     8.02%  multi-fault-all  [kernel.kallsyms]      [k] try_get_mem_cgroup_from_mm <=====(*)
     7.83%  multi-fault-all  [kernel.kallsyms]      [k] down_read_trylock
     5.38%  multi-fault-all  [kernel.kallsyms]      [k] __css_put
     5.29%  multi-fault-all  [kernel.kallsyms]      [k] __alloc_pages_nodemask
     4.92%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irq
     4.24%  multi-fault-all  [kernel.kallsyms]      [k] up_read
     3.53%  multi-fault-all  [kernel.kallsyms]      [k] css_put
     2.11%  multi-fault-all  [kernel.kallsyms]      [k] handle_mm_fault
     1.76%  multi-fault-all  [kernel.kallsyms]      [k] __rmqueue
     1.64%  multi-fault-all  [kernel.kallsyms]      [k] __mem_cgroup_commit_charge

[After]
    28.41%  multi-fault-all  [kernel.kallsyms]      [k] clear_page_c
    10.08%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irq
     9.58%  multi-fault-all  [kernel.kallsyms]      [k] down_read_trylock
     9.38%  multi-fault-all  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
     5.86%  multi-fault-all  [kernel.kallsyms]      [k] __alloc_pages_nodemask
     5.65%  multi-fault-all  [kernel.kallsyms]      [k] up_read
     2.82%  multi-fault-all  [kernel.kallsyms]      [k] handle_mm_fault
     2.64%  multi-fault-all  [kernel.kallsyms]      [k] mem_cgroup_add_lru_list
     2.48%  multi-fault-all  [kernel.kallsyms]      [k] __mem_cgroup_commit_charge

Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch
removes css_tryget() in it. (But yes, this is an extreme case.)

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: use find_lock_task_mm() in memory cgroups oom

When the OOM killer scans task, it check a task is under memcg or
not when it's called via memcg's context.

But, as Oleg pointed out, a thread group leader may have NULL ->mm
and task_in_mem_cgroup() may do wrong decision. We have to use
find_lock_task_mm() in memcg as generic OOM-Killer does.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: remove mem from arg of charge_common

mem_cgroup_charge_common() is always called with @mem = NULL, so it's
meaningless. This patch removes it.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: remove redundant code

- try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
don't have to call them in task_in_mem_cgroup().
- *mz is not used in __mem_cgroup_uncharge_common().
- we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
after we've cleared PCG_MIGRATION of @oldpage.
- remove empty comment.
- remove redundant empty line in mem_cgroup_cache_charge().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

memcg: clean up waiting move acct

Now, for checking a memcg is under task-account-moving, we do css_tryget()
against mc.to and mc.from. But this is just complicating things. This
patch makes the check easier.

This patch adds a spinlock to move_charge_struct and guard modification of
mc.to and mc.from. By this, we don't have to think about complicated
races arount this not-critical path.

[balbir@linux.vnet.ibm.com: don't crash on a null memcg being passed]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>