platform/adaptation/renesas_rcar/renesas_kernel.git
14 years agooom: remove special handling for pagefault ooms
David Rientjes [Tue, 10 Aug 2010 00:18:55 +0000 (17:18 -0700)]
oom: remove special handling for pagefault ooms

It is possible to remove the special pagefault oom handler by simply oom
locking all system zones and then calling directly into out_of_memory().

All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
parallel oom killing in progress that will lead to eventual memory freeing
so it's not necessary to needlessly kill another task.  The context in
which the pagefault is allocating memory is unknown to the oom killer, so
this is done on a system-wide level.

If a task has already been oom killed and hasn't fully exited yet, this
will be a no-op since select_bad_process() recognizes tasks across the
system with TIF_MEMDIE set.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: extract panic helper function
David Rientjes [Tue, 10 Aug 2010 00:18:54 +0000 (17:18 -0700)]
oom: extract panic helper function

There are various points in the oom killer where the kernel must determine
whether to panic or not.  It's better to extract this to a helper function
to remove all the confusion as to its semantics.

Also fix a call to dump_header() where tasklist_lock is not read- locked,
as required.

There's no functional change with this patch.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: avoid oom killer for lowmem allocations
David Rientjes [Tue, 10 Aug 2010 00:18:54 +0000 (17:18 -0700)]
oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is no
guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: enable oom tasklist dump by default
David Rientjes [Tue, 10 Aug 2010 00:18:53 +0000 (17:18 -0700)]
oom: enable oom tasklist dump by default

The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
very helpful information in diagnosing why a user's task has been killed.
It emits useful information such as each eligible thread's memory usage
that can determine why the system is oom, so it should be enabled by
default.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: select task from tasklist for mempolicy ooms
David Rientjes [Tue, 10 Aug 2010 00:18:52 +0000 (17:18 -0700)]
oom: select task from tasklist for mempolicy ooms

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: sacrifice child with highest badness score for parent
David Rientjes [Tue, 10 Aug 2010 00:18:51 +0000 (17:18 -0700)]
oom: sacrifice child with highest badness score for parent

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.  Unfortunately,
this often kills in a seemingly random fashion based on the ordering of
the selected task's child list.  Additionally, it is not guaranteed at all
to free a large amount of memory that we need to prevent additional oom
killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful and
then the parent if none were found while the new code simply kills the
most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or has
a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
so the final oom_kill_task() will always succeed.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: filter tasks not sharing the same cpuset
David Rientjes [Tue, 10 Aug 2010 00:18:50 +0000 (17:18 -0700)]
oom: filter tasks not sharing the same cpuset

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: avoid sending exiting tasks a SIGKILL
David Rientjes [Tue, 10 Aug 2010 00:18:49 +0000 (17:18 -0700)]
oom: avoid sending exiting tasks a SIGKILL

It's unnecessary to SIGKILL a task that is already PF_EXITING and can
actually cause a NULL pointer dereference of the sighand if it has already
been detached.  Instead, simply set TIF_MEMDIE so it has access to memory
reserves and can quickly exit as the comment implies.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: give current access to memory reserves if it has been killed
David Rientjes [Tue, 10 Aug 2010 00:18:48 +0000 (17:18 -0700)]
oom: give current access to memory reserves if it has been killed

It's possible to livelock the page allocator if a thread has mm->mmap_sem
and fails to make forward progress because the oom killer selects another
thread sharing the same ->mm to kill that cannot exit until the semaphore
is dropped.

The oom killer will not kill multiple tasks at the same time; each oom
killed task must exit before another task may be killed.  Thus, if one
thread is holding mm->mmap_sem and cannot allocate memory, all threads
sharing the same ->mm are blocked from exiting as well.  In the oom kill
case, that means the thread holding mm->mmap_sem will never free
additional memory since it cannot get access to memory reserves and the
thread that depends on it with access to memory reserves cannot exit
because it cannot acquire the semaphore.  Thus, the page allocators
livelocks.

When the oom killer is called and current happens to have a pending
SIGKILL, this patch automatically gives it access to memory reserves and
returns.  Upon returning to the page allocator, its allocation will
hopefully succeed so it can quickly exit and free its memory.  If not, the
page allocator will fail the allocation if it is not __GFP_NOFAIL.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: dump_tasks use find_lock_task_mm too fix
David Rientjes [Tue, 10 Aug 2010 00:18:47 +0000 (17:18 -0700)]
oom: dump_tasks use find_lock_task_mm too fix

When find_lock_task_mm() returns a thread other than p in dump_tasks(),
its name should be displayed instead.  This is the thread that will be
targeted by the oom killer, not its mm-less parent.

This also allows us to safely dereference task->comm without needing
get_task_comm().

While we're here, remove the cast on task_cpu(task) as Andrew suggested.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: improve commentary in dump_tasks()
David Rientjes [Tue, 10 Aug 2010 00:18:46 +0000 (17:18 -0700)]
oom: improve commentary in dump_tasks()

The comments in dump_tasks() should be updated to be more clear about why
tasks are filtered and how they are filtered by its argument.

An unnecessary comment concerning a check for is_global_init() is removed
since it isn't of importance.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: dump_tasks use find_lock_task_mm too
KOSAKI Motohiro [Tue, 10 Aug 2010 00:18:46 +0000 (17:18 -0700)]
oom: dump_tasks use find_lock_task_mm too

dump_task() should use find_lock_task_mm() too. It is necessary for
protecting task-exiting race.

dump_tasks() currently filters any task that does not have an attached
->mm since it incorrectly assumes that it must either be in the process of
exiting and has detached its memory or that it's a kernel thread;
multithreaded tasks may actually have subthreads that have a valid ->mm
pointer and thus those threads should actually be displayed.  This change
finds those threads, if they exist, and emit their information along with
the rest of the candidate tasks for kill.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: introduce find_lock_task_mm() to fix !mm false positives
Oleg Nesterov [Tue, 10 Aug 2010 00:18:45 +0000 (17:18 -0700)]
oom: introduce find_lock_task_mm() to fix !mm false positives

Almost all ->mm == NULL checks in oom_kill.c are wrong.

The current code assumes that the task without ->mm has already released
its memory and ignores the process.  However this is not necessarily true
when this process is multithreaded, other live sub-threads can use this
->mm.

- Remove the "if (!p->mm)" check in select_bad_process(), it is
  just wrong.

- Add the new helper, find_lock_task_mm(), which finds the live
  thread which uses the memory and takes task_lock() to pin ->mm

- change oom_badness() to use this helper instead of just checking
  ->mm != NULL.

- As David pointed out, select_bad_process() must never choose the
  task without ->mm, but no matter what oom_badness() returns the
  task can be chosen if nothing else has been found yet.

  Change oom_badness() to return int, change it to return -1 if
  find_lock_task_mm() fails, and change select_bad_process() to
  check points >= 0.

Note! This patch is not enough, we need more changes.

- oom_badness() was fixed, but oom_kill_task() still ignores
  the task without ->mm

- oom_forkbomb_penalty() should use find_lock_task_mm() too,
  and it also needs other changes to actually find the first
  first-descendant children

This will be addressed later.

[kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: PF_EXITING check should take mm into account
Oleg Nesterov [Tue, 10 Aug 2010 00:18:44 +0000 (17:18 -0700)]
oom: PF_EXITING check should take mm into account

select_bad_process() checks PF_EXITING to detect the task which is going
to release its memory, but the logic is very wrong.

- a single process P with the dead group leader disables
  select_bad_process() completely, it will always return
  ERR_PTR() while P can live forever

- if the PF_EXITING task has already released its ->mm
  it doesn't make sense to expect it is goiing to free
  more memory (except task_struct/etc)

Change the code to ignore the PF_EXITING tasks without ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agooom: check PF_KTHREAD instead of !mm to skip kthreads
Oleg Nesterov [Tue, 10 Aug 2010 00:18:43 +0000 (17:18 -0700)]
oom: check PF_KTHREAD instead of !mm to skip kthreads

select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().

Change the code to check PF_KTHREAD.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agobuffer_head: remove redundant test from wait_on_buffer
Richard Kennedy [Tue, 10 Aug 2010 00:18:42 +0000 (17:18 -0700)]
buffer_head: remove redundant test from wait_on_buffer

The comment suggests that when b_count equals zero it is calling
__wait_no_buffer to trigger some debug, but as there is no debug in
__wait_on_buffer the whole thing is redundant.

AFAICT from the git log this has been the case for at least 5 years, so
it seems safe just to remove this.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: extend KSM refcounts to the anon_vma root
Rik van Riel [Tue, 10 Aug 2010 00:18:41 +0000 (17:18 -0700)]
mm: extend KSM refcounts to the anon_vma root

KSM reference counts can cause an anon_vma to exist after the processe it
belongs to have already exited.  Because the anon_vma lock now lives in
the root anon_vma, we need to ensure that the root anon_vma stays around
until after all the "child" anon_vmas have been freed.

The obvious way to do this is to have a "child" anon_vma take a reference
to the root in anon_vma_fork.  When the anon_vma is freed at munmap or
process exit, we drop the refcount in anon_vma_unlink and possibly free
the root anon_vma.

The KSM anon_vma reference count function also needs to be modified to
deal with the possibility of freeing 2 levels of anon_vma.  The easiest
way to do this is to break out the KSM magic and make it generic.

When compiling without CONFIG_KSM, this code is compiled out.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: always lock the root (oldest) anon_vma
Rik van Riel [Tue, 10 Aug 2010 00:18:40 +0000 (17:18 -0700)]
mm: always lock the root (oldest) anon_vma

Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma.  The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned.  Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.

However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.

Also add the proper locking to vma_adjust.

Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: track the root (oldest) anon_vma
Rik van Riel [Tue, 10 Aug 2010 00:18:39 +0000 (17:18 -0700)]
mm: track the root (oldest) anon_vma

Track the root (oldest) anon_vma in each anon_vma tree.  Because we only
take the lock on the root anon_vma, we cannot use the lock on higher-up
anon_vmas to lock anything.  This makes it impossible to do an indirect
lookup of the root anon_vma, since the data structures could go away from
under us.

However, a direct pointer is safe because the root anon_vma is always the
last one that gets freed on munmap or exit, by virtue of the same_vma list
order and unlink_anon_vmas walking the list forward.

[akpm@linux-foundation.org: fix typo]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: change direct call of spin_lock(anon_vma->lock) to inline function
Rik van Riel [Tue, 10 Aug 2010 00:18:38 +0000 (17:18 -0700)]
mm: change direct call of spin_lock(anon_vma->lock) to inline function

Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
function doing exactly the same.

This makes it easier to do the substitution to the root anon_vma lock in a
following patch.

We will deal with the handful of special locks (nested, dec_and_lock, etc)
separately.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: rename anon_vma_lock to vma_lock_anon_vma
Rik van Riel [Tue, 10 Aug 2010 00:18:37 +0000 (17:18 -0700)]
mm: rename anon_vma_lock to vma_lock_anon_vma

Rename anon_vma_lock to vma_lock_anon_vma.  This matches the naming style
used in page_lock_anon_vma and will come in really handy further down in
this patch series.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agokmap_atomic: make kunmap_atomic() harder to misuse
Cesar Eduardo Barros [Tue, 10 Aug 2010 00:18:32 +0000 (17:18 -0700)]
kmap_atomic: make kunmap_atomic() harder to misuse

kunmap_atomic() is currently at level -4 on Rusty's "Hard To Misuse"
list[1] ("Follow common convention and you'll get it wrong"), except in
some architectures when CONFIG_DEBUG_HIGHMEM is set[2][3].

kunmap() takes a pointer to a struct page; kunmap_atomic(), however, takes
takes a pointer to within the page itself.  This seems to once in a while
trip people up (the convention they are following is the one from
kunmap()).

Make it much harder to misuse, by moving it to level 9 on Rusty's list[4]
("The compiler/linker won't let you get it wrong").  This is done by
refusing to build if the type of its first argument is a pointer to a
struct page.

The real kunmap_atomic() is renamed to kunmap_atomic_notypecheck()
(which is what you would call in case for some strange reason calling it
with a pointer to a struct page is not incorrect in your code).

The previous version of this patch was compile tested on x86-64.

[1] http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
[2] In these cases, it is at level 5, "Do it right or it will always
    break at runtime."
[3] At least mips and powerpc look very similar, and sparc also seems to
    share a common ancestor with both; there seems to be quite some
    degree of copy-and-paste coding here. The include/asm/highmem.h file
    for these three archs mention x86 CPUs at its top.
[4] http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html
[5] As an aside, could someone tell me why mn10300 uses unsigned long as
    the first parameter of kunmap_atomic() instead of void *?

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Cc: Russell King <linux@arm.linux.org.uk> (arch/arm)
Cc: Ralf Baechle <ralf@linux-mips.org> (arch/mips)
Cc: David Howells <dhowells@redhat.com> (arch/frv, arch/mn10300)
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> (arch/mn10300)
Cc: Kyle McMartin <kyle@mcmartin.ca> (arch/parisc)
Cc: Helge Deller <deller@gmx.de> (arch/parisc)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> (arch/parisc)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> (arch/powerpc)
Cc: Paul Mackerras <paulus@samba.org> (arch/powerpc)
Cc: "David S. Miller" <davem@davemloft.net> (arch/sparc)
Cc: Thomas Gleixner <tglx@linutronix.de> (arch/x86)
Cc: Ingo Molnar <mingo@redhat.com> (arch/x86)
Cc: "H. Peter Anvin" <hpa@zytor.com> (arch/x86)
Cc: Arnd Bergmann <arnd@arndb.de> (include/asm-generic)
Cc: Rusty Russell <rusty@rustcorp.com.au> ("Hard To Misuse" list)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agohugetlb: call mmu notifiers on hugepage cow
Doug Doan [Tue, 10 Aug 2010 00:18:30 +0000 (17:18 -0700)]
hugetlb: call mmu notifiers on hugepage cow

When a copy-on-write occurs, we take one of two paths in handle_mm_fault:
through handle_pte_fault for normal pages, or through hugetlb_fault for
huge pages.

In the normal page case, we eventually get to do_wp_page and call mmu
notifiers via ptep_clear_flush_notify.  There is no callout to the mmmu
notifiers in the huge page case.  This patch fixes that.

Signed-off-by: Doug Doan <dougd@cray.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: provide init_mm mm_context initializer
Heiko Carstens [Tue, 10 Aug 2010 00:18:28 +0000 (17:18 -0700)]
mm: provide init_mm mm_context initializer

Provide an INIT_MM_CONTEXT intializer macro which can be used to
statically initialize mm_struct:mm_context of init_mm.  This way we can
get rid of code which will do the initialization at run time (on s390).

In addition the current code can be found at a place where it is not
expected.  So let's have a common initializer which architectures
can use if needed.

This is based on a patch from Suzuki Poulose.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Suzuki Poulose <suzuki@in.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: use ERR_CAST
Julia Lawall [Tue, 10 Aug 2010 00:18:28 +0000 (17:18 -0700)]
mm: use ERR_CAST

Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
 ...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agomm: use memdup_user
Julia Lawall [Tue, 10 Aug 2010 00:18:26 +0000 (17:18 -0700)]
mm: use memdup_user

Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

-  to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+  to = memdup_user(from,size);
   if (
-      to==NULL
+      IS_ERR(to)
                 || ...) {
   <+... when != goto l1;
-  -ENOMEM
+  PTR_ERR(to)
   ...+>
   }
-  if (copy_from_user(to, from, size) != 0) {
-    <+... when != goto l2;
-    -EFAULT
-    ...+>
-  }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoasm-generic: use raw_local_irq_save/restore instead local_irq_save/restore
Michal Simek [Tue, 10 Aug 2010 00:18:24 +0000 (17:18 -0700)]
asm-generic: use raw_local_irq_save/restore instead local_irq_save/restore

The start/stop_critical_timing functions for preemptirqsoff, preemptoff
and irqsoff tracers contain atomic_inc() and atomic_dec() operations.

Atomic operations use local_irq_save/restore macros to ensure atomic
access but they are traced by the same function which is causing recursion
problem.

The reason is when these tracers are turn ON then the
local_irq_save/restore macros are changed in include/linux/irqflags.h to
call trace_hardirqs_on/off which call start/stop_critical_timing.

Microblaze was affected because it uses generic atomic implementation.

Signed-off-by: Michal Simek <monstr@monstr.eu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agodrivers/video/w100fb.c: ignore void return value / fix build failure
Peter Huewe [Tue, 10 Aug 2010 00:18:23 +0000 (17:18 -0700)]
drivers/video/w100fb.c: ignore void return value / fix build failure

Fix a build failure "error: void value not ignored as it ought to be"
by removing an assignment of a void return value.  The functionality of
the code is not changed.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Acked-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoipmi: fix ACPI detection with regspacing
Yinghai Lu [Tue, 10 Aug 2010 00:18:22 +0000 (17:18 -0700)]
ipmi: fix ACPI detection with regspacing

After the commit that changed ipmi_si detecting sequence from SMBIOS/ACPI
to ACPI/SMBIOS,

| commit 754d453185275951d39792865927ec494fa1ebd8
| Author: Matthew Garrett <mjg@redhat.com>
| Date:   Wed May 26 14:43:47 2010 -0700
|
|    ipmi: change device discovery order
|
|    The ipmi spec provides an ordering for si discovery.  Change the driver to
|    match, with the exception of preferring smbios to SPMI as HPs (at least)
|    contain accurate information in the former but not the latter.

ipmi_si can not be initialized.

[  138.799739] calling  init_ipmi_devintf+0x0/0x109 @ 1
[  138.805050] ipmi device interface
[  138.818131] initcall init_ipmi_devintf+0x0/0x109 returned 0 after 12797 usecs
[  138.822998] calling  init_ipmi_si+0x0/0xa90 @ 1
[  138.840276] IPMI System Interface driver.
[  138.846137] ipmi_si: probing via ACPI
[  138.849225] ipmi_si 00:09: [io  0x0ca2] regsize 1 spacing 1 irq 0
[  138.864438] ipmi_si: Adding ACPI-specified kcs state machine
[  138.870893] ipmi_si: probing via SMBIOS
[  138.880945] ipmi_si: Adding SMBIOS-specified kcs state machineipmi_si: duplicate interface
[  138.896511] ipmi_si: probing via SPMI
[  138.899861] ipmi_si: Adding SPMI-specified kcs state machineipmi_si: duplicate interface
[  138.917095] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
[  138.928658] ipmi_si: Interface detection failed
[  138.953411] initcall init_ipmi_si+0x0/0xa90 returned 0 after 110847 usecs

in smbios has
DMI/SMBIOS
Handle 0x00C5, DMI type 38, 18 bytes
IPMI Device Information
        Interface Type: KCS (Keyboard Control Style)
        Specification Version: 2.0
        I2C Slave Address: 0x00
        NV Storage Device: Not Present
        Base Address: 0x0000000000000CA2 (I/O)
        Register Spacing: 32-bit Boundaries
in DSDT has
                    Device (BMC)
                    {

                        Name (_HID, EisaId ("IPI0001"))
                        Method (_STA, 0, NotSerialized)
                        {
                            If (LEqual (OSN, Zero))
                            {
                                Return (Zero)
                            }

                            Return (0x0F)
                        }

                        Name (_STR, Unicode ("IPMI_KCS"))
                        Name (_UID, Zero)
                        Name (_CRS, ResourceTemplate ()
                        {
                            IO (Decode16,
                                0x0CA2,             // Range Minimum
                                0x0CA2,             // Range Maximum
                                0x00,               // Alignment
                                0x01,               // Length
                                )
                            IO (Decode16,
                                0x0CA6,             // Range Minimum
                                0x0CA6,             // Range Maximum
                                0x00,               // Alignment
                                0x01,               // Length
                                )
                        })
                        Method (_IFT, 0, NotSerialized)
                        {
                            Return (One)
                        }

                        Method (_SRV, 0, NotSerialized)
                        {
                            Return (0x0200)
                        }
                    }

so the reg spacing should be 4 instead of 1.

Try to calculate regspacing for this kind of system.

Observed on a Sun Fire X4800.  Other OSes work and pass certification.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Myron Stowe <myron.stowe@hp.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Linus Torvalds [Tue, 10 Aug 2010 02:30:17 +0000 (19:30 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tj/wq

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  drm: fix fallouts from slow-work -> wq conversion
  workqueue: workqueue_cpu_callback() should be cpu_notifier instead of hotcpu_notifier
  workqueue: add missing __percpu markup in kernel/workqueue.c

14 years agono need for list_for_each_entry_safe()/resetting with superblock list
Al Viro [Sat, 24 Jul 2010 22:31:46 +0000 (02:31 +0400)]
no need for list_for_each_entry_safe()/resetting with superblock list

just delay __put_super() a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoFix sget() race with failing mount
Al Viro [Mon, 9 Aug 2010 16:05:43 +0000 (12:05 -0400)]
Fix sget() race with failing mount

If sget() finds a matching superblock being set up, it'll
grab an active reference to it and grab s_umount.  That's
fine - we'll wait for completion of foofs_get_sb() that way.
However, if said foofs_get_sb() fails we'll end up holding
the halfway-created superblock.  deactivate_locked_super()
called by foofs_get_sb() will just unlock the sucker since
we are holding another active reference to it.

What we need is a way to tell if superblock has been successfully
set up.  Unfortunately, neither ->s_root nor the check for
MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
new flag set by the (only) caller of ->get_sb().  If that flag
isn't present by the time sget() grabbed s_umount on preexisting
superblock it has found, it's seeing a stillborn and should
just bury it with deactivate_locked_super() (and repeat the search).

Longer term we want to set that flag in ->get_sb() instances (and
check for it to distinguish between "sget() found us a live sb"
and "sget() has allocated an sb, we need to set it up" in there,
instead of checking ->s_root as we do now).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
14 years agovfs: don't hold s_umount over close_bdev_exclusive() call
Tejun Heo [Tue, 20 Jul 2010 22:18:07 +0000 (15:18 -0700)]
vfs: don't hold s_umount over close_bdev_exclusive() call

Fix an obscure AB-BA deadlock in get_sb_bdev().

When a superblock is mounted more than once get_sb_bdev() calls
close_bdev_exclusive() to drop the extra bdev reference while holding
s_umount.  However, sb->s_umount nests inside bd_mutex during
__invalidate_device() and close_bdev_exclusive() acquires bd_mutex during
blkdev_put(); thus creating an AB-BA deadlock.

This condition doesn't trigger frequently.  For this condition to be
visible to lockdep, the filesystem must occupy the whole device (as
__invalidate_device() only grabs bd_mutex for the whole device), the FS
must be mounted more than once and partition rescan should be issued while
the FS is still mounted.

Fix it by dropping s_umount over close_bdev_exclusive().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ciprian Docan <docan@eden.rutgers.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agosysv: do not mark superblock dirty on remount
Artem Bityutskiy [Mon, 5 Jul 2010 12:15:04 +0000 (15:15 +0300)]
sysv: do not mark superblock dirty on remount

No need to mark the superblock as dirty in sysv_remount, synchronize
it instead (only if mounting R/O).

I did not find any docs about this file-system, and I have no possibility
to test my changes. Thus, this is untested. I see other issues in sysv,
e.g., why sysv_sync_fs writes only in the FSTYPE_SYSV4 case? However,
it marks its SB bh's dirty for all types, and does not wait for them
ever. With zero docs I'm unable to fix this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agosysv: do not mark superblock dirty on mount
Artem Bityutskiy [Mon, 5 Jul 2010 12:15:03 +0000 (15:15 +0300)]
sysv: do not mark superblock dirty on mount

I did not find any docs about this file-system, and I have no possibility
to test my changes. Thus, this is untested.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agobtrfs: remove junk sb_dirt change
Artem Bityutskiy [Mon, 5 Jul 2010 12:15:02 +0000 (15:15 +0300)]
btrfs: remove junk sb_dirt change

BTRFS does not define a '->write_super()' method, so it should
not mark its superblock as dirty. This looks like some left-over.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoBFS: clean up the superblock usage
Artem Bityutskiy [Mon, 5 Jul 2010 12:15:01 +0000 (15:15 +0300)]
BFS: clean up the superblock usage

BFS is a very simple FS and its superblocks contains only static
information and is never changed. However, the BFS code for some
misterious reasons marked its buffer head as dirty from time to
time, but nothing in that buffer was ever changed.

This patch removes all the BFS superblock manipulation, simply
because it is not needed. It removes:

1. The si_sbh filed from 'struct bfs_sb_info' because it is not
   needed. We only need to read the SB once on mount to get the
   start of data blocks and the FS size. After this, we can forget
   about the SB.
2. All instances of 'mark_buffer_dirty(sbh)' for BFS SB because
   it is never changed.
3. The '->sync_fs()' method because there is nothing to sync
   (inodes are synched by VFS).
4. The '->write_super()' method, again, because the SB is never
   changed.

Tested-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoAFFS: wait for sb synchronization when needed
Artem Bityutskiy [Mon, 5 Jul 2010 12:15:00 +0000 (15:15 +0300)]
AFFS: wait for sb synchronization when needed

AFFS does not ever wait for superblock synchronization in
->put_super(), ->write_super, and ->sync_fs().

However, it should wait for synchronization in ->put_super() because
it is about to be unmounted, in ->write_super() because this is
periodic SB synchronization performed from a separate kernel thread,
and in ->sync_fs() it should respect the 'wait' flag. This patch fixes
the situation.

Also, in ->put_super(), do not write the SB if it is not dirty.

Tested-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoAFFS: clean up dirty flag usage
Artem Bityutskiy [Mon, 5 Jul 2010 12:14:59 +0000 (15:14 +0300)]
AFFS: clean up dirty flag usage

In 'affs_write_super()': remove ancient and wrong commented code,
remove unneeded 'clean' variable, so the function becomes a bit
cleaner and simpler.

In 'affs_remount(): remove unnecessary SB dirty flag changes.

Tested-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agocifs: truncate fallout
Christoph Hellwig [Sun, 18 Jul 2010 21:51:21 +0000 (17:51 -0400)]
cifs: truncate fallout

Remove the calls to inode_newsize_ok given that we already did it as
part of inode_change_ok in the beginning of cifs_setattr_(no)unix.

No need to call ->truncate if cifs doesn't have one, so remove the
explicit call in cifs_vmtruncate, and replace the calls to vmtruncate
with truncate_setsize which is vmtruncate minus inode_newsize_ok
and the call to ->truncate.

Rename cifs_vmtruncate to cifs_setsize to match the new calling conventions.

Question 1:  why does cifs do the pagecache munging and i_size update twice
for each setattr call, once opencoded in cifs_vmtruncate, and once
using the VFS helpers?
Question 2: what is supposed to be protected by i_lock in cifs_vmtruncate?
Do we need it around the call to inode_change_ok?

[AV: fixed build breakage]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agombcache: fix shrinker function return value
Andreas Gruenbacher [Wed, 21 Jul 2010 17:44:45 +0000 (19:44 +0200)]
mbcache: fix shrinker function return value

The shrinker function is supposed to return the number of cache
entries after shrinking, not before shrinking.  Fix that.

Based on a patch from Wang Sheng-Hui <crosslonelyover@gmail.com>.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agombcache: Remove unused features
Andreas Gruenbacher [Mon, 19 Jul 2010 16:19:41 +0000 (18:19 +0200)]
mbcache: Remove unused features

The mbcache code was written to support a variable number of indexes,
but all the existing users use exactly one index.  Simplify to code to
support only that case.

There are also no users of the cache entry free operation, and none of
the users keep extra data in cache entries.  Remove those features as
well.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoadd f_flags to struct statfs(64)
Christoph Hellwig [Wed, 7 Jul 2010 16:53:25 +0000 (18:53 +0200)]
add f_flags to struct statfs(64)

Add a flags field to help glibc implementing statvfs(3) efficiently.

We copy the flag values from glibc, and add a new ST_VALID flag to
denote that f_flags is implemented.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agopass a struct path to vfs_statfs
Christoph Hellwig [Wed, 7 Jul 2010 16:53:11 +0000 (18:53 +0200)]
pass a struct path to vfs_statfs

We'll need the path to implement the flags field for statvfs support.
We do have it available in all callers except:

 - ecryptfs_statfs.  This one doesn't actually need vfs_statfs but just
   needs to do a caller to the lower filesystem statfs method.
 - sys_ustat.  Add a non-exported statfs_by_dentry helper for it which
   doesn't won't be able to fill out the flags field later on.

In addition rename the helpers for statfs vs fstatfs to do_*statfs instead
of the misleading vfs prefix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoupdate VFS documentation for method changes.
Al Viro [Tue, 8 Jun 2010 04:37:12 +0000 (00:37 -0400)]
update VFS documentation for method changes.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoAll filesystems that need invalidate_inode_buffers() are doing that explicitly
Al Viro [Mon, 7 Jun 2010 18:35:46 +0000 (14:35 -0400)]
All filesystems that need invalidate_inode_buffers() are doing that explicitly

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert remaining ->clear_inode() to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 18:34:48 +0000 (14:34 -0400)]
convert remaining ->clear_inode() to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoMake ->drop_inode() just return whether inode needs to be dropped
Al Viro [Mon, 7 Jun 2010 17:43:19 +0000 (13:43 -0400)]
Make ->drop_inode() just return whether inode needs to be dropped

... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agofs/inode.c:clear_inode() is gone
Al Viro [Mon, 7 Jun 2010 17:23:20 +0000 (13:23 -0400)]
fs/inode.c:clear_inode() is gone

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agofs/inode.c:evict() doesn't care about delete vs. non-delete paths now
Al Viro [Mon, 7 Jun 2010 17:21:05 +0000 (13:21 -0400)]
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years ago->delete_inode() is gone
Al Viro [Mon, 7 Jun 2010 17:20:09 +0000 (13:20 -0400)]
->delete_inode() is gone

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert ext4 to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 17:16:22 +0000 (13:16 -0400)]
convert ext4 to ->evict_inode()

pretty much brute-force...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert logfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 17:11:34 +0000 (13:11 -0400)]
convert logfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agologfs: get rid of magical inodes
Al Viro [Mon, 7 Jun 2010 16:22:31 +0000 (12:22 -0400)]
logfs: get rid of magical inodes

ordering problems at ->kill_sb() time are solved by doing iput()
of these suckers in ->put_super()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert nilfs2 to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 15:55:00 +0000 (11:55 -0400)]
convert nilfs2 to ->evict_inode()

[folded build fix from sfr]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert exofs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 15:42:26 +0000 (11:42 -0400)]
convert exofs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert reiserfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 15:37:37 +0000 (11:37 -0400)]
convert reiserfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert btrfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 15:35:40 +0000 (11:35 -0400)]
convert btrfs to ->evict_inode()

NB: do we want btrfs_wait_ordered_range() on eviction of
inodes with positive i_nlink on subvolume with zero root_refs?
If not, btrfs_evict_inode() can be simplified by unconditionally
bailing out in case of i_nlink > 0 in the very beginning...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch gfs2 to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 15:05:19 +0000 (11:05 -0400)]
switch gfs2 to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert ocfs2 to ->evict_inode()
Al Viro [Wed, 9 Jun 2010 01:28:10 +0000 (21:28 -0400)]
convert ocfs2 to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch ncpfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 04:45:56 +0000 (00:45 -0400)]
switch ncpfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch udf to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 04:43:39 +0000 (00:43 -0400)]
switch udf to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch ubifs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 04:34:05 +0000 (00:34 -0400)]
switch ubifs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch jfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 04:28:54 +0000 (00:28 -0400)]
switch jfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch hpfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 04:18:40 +0000 (00:18 -0400)]
switch hpfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch hppfs to ->evict_inode()
Al Viro [Mon, 7 Jun 2010 04:12:50 +0000 (00:12 -0400)]
switch hppfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agotry to get rid of races in hostfs open()
Al Viro [Mon, 7 Jun 2010 03:49:18 +0000 (23:49 -0400)]
try to get rid of races in hostfs open()

In case of mode mismatch, do *not* blindly close the descriptor
another openers might be using right now.  Open the underlying
file with currently sufficient mode, then
* if current mode has grown so that it's sufficient for
us now, just close our new fd
* if current mode has grown and our fd is *not* enough
to cover it, close and repeat.
* otherwise, install our fd if the file hadn't been
opened at all or dup2() our fd over the current one (and close
our fd).
Critical section is protected by mutex; yes, system-wide.  All
we do under it is a bunch of comparison and maybe an overwriting
dup2() on host.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoleak in hostfs_unlink()
Al Viro [Mon, 7 Jun 2010 03:19:04 +0000 (23:19 -0400)]
leak in hostfs_unlink()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agohostfs: fix races in dentry_name() and inode_name()
Al Viro [Mon, 7 Jun 2010 03:16:34 +0000 (23:16 -0400)]
hostfs: fix races in dentry_name() and inode_name()

calculating size, then doing allocation, then filling the
path is a Bad Idea(tm), since the ancestors can be renamed,
leading to buffer overrun.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agonew helper: __dentry_path()
Al Viro [Mon, 7 Jun 2010 02:31:14 +0000 (22:31 -0400)]
new helper: __dentry_path()

builds path relative to fs root, called under dcache_lock,
doesn't append any nonsense to unlinked ones.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agohostfs: sanitize symlinks
Al Viro [Mon, 7 Jun 2010 01:51:16 +0000 (21:51 -0400)]
hostfs: sanitize symlinks

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agohostfs: get rid of inode_dentry_name()
Al Viro [Mon, 7 Jun 2010 00:42:10 +0000 (20:42 -0400)]
hostfs: get rid of inode_dentry_name()

it's equivalent to dentry_name() anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agohostfs: get rid of file_type(), fold init_inode()
Al Viro [Mon, 7 Jun 2010 00:33:12 +0000 (20:33 -0400)]
hostfs: get rid of file_type(), fold init_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch stat_file() to passing a single struct rather than fsckloads of pointers
Al Viro [Mon, 7 Jun 2010 00:08:56 +0000 (20:08 -0400)]
switch stat_file() to passing a single struct rather than fsckloads of pointers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agohostfs: pass pathname to init_inode()
Al Viro [Sun, 6 Jun 2010 23:38:18 +0000 (19:38 -0400)]
hostfs: pass pathname to init_inode()

We will calculate it in all callers anyway, so there's no
need to duplicate that inside.  Moreover, that way we lose
all failure exits in init_inode(), so it doesn't need to
return anything.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoget rid of hostfs_read_inode()
Al Viro [Sun, 6 Jun 2010 22:43:19 +0000 (18:43 -0400)]
get rid of hostfs_read_inode()

There are only two call sites; in one (hostfs_iget()) it's actually
a no-op and in another (fill_super()) it's easier to expand the
damn thing and use what we know about its arguments to simplify
it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agohostfs: don't keep a field in each inode when we are using it only in root
Al Viro [Sun, 6 Jun 2010 21:53:01 +0000 (17:53 -0400)]
hostfs: don't keep a field in each inode when we are using it only in root

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agostop icache pollution in hostfs, switch to ->evict_inode()
Al Viro [Sun, 6 Jun 2010 19:16:17 +0000 (15:16 -0400)]
stop icache pollution in hostfs, switch to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch affs to ->evict_inode()
Al Viro [Sun, 6 Jun 2010 14:16:41 +0000 (10:16 -0400)]
switch affs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch omfs to ->evict_inode()
Al Viro [Sun, 6 Jun 2010 14:12:01 +0000 (10:12 -0400)]
switch omfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch bfs to ->evict_inode(), clean up
Al Viro [Sun, 6 Jun 2010 13:50:39 +0000 (09:50 -0400)]
switch bfs to ->evict_inode(), clean up

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoconvert ext3 to ->evict_inode()
Al Viro [Sun, 6 Jun 2010 11:08:19 +0000 (07:08 -0400)]
convert ext3 to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agospufs conversion to ->evict_inode()
Al Viro [Sun, 6 Jun 2010 01:20:32 +0000 (21:20 -0400)]
spufs conversion to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch ufs to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 23:40:56 +0000 (19:40 -0400)]
switch ufs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agocovert fatfs to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 23:28:32 +0000 (19:28 -0400)]
covert fatfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch smbfs to evict_inode()
Al Viro [Sat, 5 Jun 2010 23:22:50 +0000 (19:22 -0400)]
switch smbfs to evict_inode()

NB: treatment of inode hash is completely braindead there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch sysv to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 23:16:20 +0000 (19:16 -0400)]
switch sysv to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch shmem.c to ->evice_inode()
Al Viro [Sat, 5 Jun 2010 23:10:41 +0000 (19:10 -0400)]
switch shmem.c to ->evice_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch mqueue to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 20:29:45 +0000 (16:29 -0400)]
switch mqueue to ->evict_inode()

... and since the inodes are never hashed, we can use default ->drop_inode()
just fine.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agomerge ext2 delete_inode and clear_inode, switch to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 03:32:28 +0000 (23:32 -0400)]
merge ext2 delete_inode and clear_inode, switch to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoDon't dirty the victim in ext2_xattr_delete_inode()
Al Viro [Wed, 21 Jul 2010 21:22:47 +0000 (01:22 +0400)]
Don't dirty the victim in ext2_xattr_delete_inode()

... it's beyond fs-writeback reach already - writeback won't
be started at that point.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoTake dirtying the inode to callers of ext2_free_blocks()
Al Viro [Wed, 21 Jul 2010 21:19:42 +0000 (01:19 +0400)]
Take dirtying the inode to callers of ext2_free_blocks()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoext2: switch to dquot_free_block_nodirty()
Al Viro [Wed, 21 Jul 2010 21:13:36 +0000 (01:13 +0400)]
ext2: switch to dquot_free_block_nodirty()

brute-force conversion

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch minix to ->evict_inode(), fix write_inode/delete_inode race
Al Viro [Sat, 5 Jun 2010 02:27:38 +0000 (22:27 -0400)]
switch minix to ->evict_inode(), fix write_inode/delete_inode race

We need to wait for completion of possible writeback in progress
before we clear on-disk inode during deletion.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch sysfs to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 02:21:54 +0000 (22:21 -0400)]
switch sysfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch procfs to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 02:17:56 +0000 (22:17 -0400)]
switch procfs to ->evict_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agosimplify get_cramfs_inode()
Al Viro [Sat, 5 Jun 2010 01:19:01 +0000 (21:19 -0400)]
simplify get_cramfs_inode()

simply don't hash the inodes that don't have real inumber instead of
skipping them during iget5_locked(); as the result, simple iget_locked()
would do and we can get rid of cramfs ->drop_inode() as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoswitch hypfs to ->evict_inode()
Al Viro [Sat, 5 Jun 2010 01:02:59 +0000 (21:02 -0400)]
switch hypfs to ->evict_inode()

... and since we never hash its inodes, default
->drop_inode() will work just fine.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agonew helper: end_writeback()
Al Viro [Sat, 5 Jun 2010 00:55:25 +0000 (20:55 -0400)]
new helper: end_writeback()

Essentially, the minimal variant of ->evict_inode().  It's
a trimmed-down clear_inode(), sans any fs callbacks.  Once
it returns we know that no async writeback will be happening;
every ->evict_inode() instance should do that once and do that
before doing anything ->write_inode() could interfere with
(e.g. freeing the on-disk inode).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
14 years agoTake ->i_bdev/->i_cdev handling out of clear_inode()
Al Viro [Sat, 5 Jun 2010 00:19:55 +0000 (20:19 -0400)]
Take ->i_bdev/->i_cdev handling out of clear_inode()

All call chains to clear_inode() pass through evict_inode() and
clear_inode() should be called by evict_inode() exactly once.
So we can pull i_bdev/i_cdev detaching up to evict_inode() itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>