platform/upstream/kernel-adaptation-pc.git
11 years agomm: do not use page_count() without a page pin
Minchan Kim [Tue, 31 Jul 2012 23:42:59 +0000 (16:42 -0700)]
mm: do not use page_count() without a page pin

d179e84ba ("mm: vmscan: do not use page_count without a page pin") fixed
this problem in vmscan.c but same problem is in __count_immobile_pages().

I copy and paste d179e84ba's contents for description.

"It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU."

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wanpeng Li <liwp.linux@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm, oom: replace some information in tasklist dump
David Rientjes [Tue, 31 Jul 2012 23:42:56 +0000 (16:42 -0700)]
mm, oom: replace some information in tasklist dump

The number of ptes and swap entries are used in the oom killer's badness
heuristic, so they should be shown in the tasklist dump.

This patch adds those fields and replaces cpu and oom_adj values that are
currently emitted.  Cpu isn't interesting and oom_adj is deprecated and
will be removed later this year, the same information is already displayed
as oom_score_adj which is used internally.

At the same time, make the documentation a little more clear to state this
information is helpful to determine why the oom killer chose the task it
did to kill.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm, oom: fix potential killing of thread that is disabled from oom killing
David Rientjes [Tue, 31 Jul 2012 23:42:55 +0000 (16:42 -0700)]
mm, oom: fix potential killing of thread that is disabled from oom killing

/proc/sys/vm/oom_kill_allocating_task will immediately kill current when
the oom killer is called to avoid a potentially expensive tasklist scan
for large systems.

Currently, however, it is not checking current's oom_score_adj value which
may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
killing.

This patch avoids killing current in such a condition and simply falls
back to the tasklist scan since memory still needs to be freed.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator...
KOSAKI Motohiro [Tue, 31 Jul 2012 23:42:53 +0000 (16:42 -0700)]
mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again

commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds
pages to the buddy allocator again") fixed one free_pcppages_bulk()
misuse.  But two another miuse still exist.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()
KOSAKI Motohiro [Tue, 31 Jul 2012 23:42:50 +0000 (16:42 -0700)]
mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()

Eric Wong reported his test suite failex when /tmp is tmpfs.

https://lkml.org/lkml/2012/2/24/479

Currentlt the input check of POSIX_FADV_WILLNEED has two problems.

- requires a_ops->readpage.  But in fact, force_page_cache_readahead()
  requires that the target filesystem has either ->readpage or ->readpages.

- returns -EINVAL when the filesystem doesn't have ->readpage.  But
  posix says that fadvise is merely a hint.  Thus fadvise() should return
  0 if filesystem has no means of implementing fadvise().  The userland
  application should not know nor care whcih type of filesystem backs the
  TMPDIR directory, as Eric pointed out.  There is nothing which userspace
  can do to solve this error.

So change the return value to 0 when filesytem doesn't support readahead.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Wong <normalperson@yhbt.net>
Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm/compaction: cleanup on compaction_deferred
Gavin Shan [Tue, 31 Jul 2012 23:42:49 +0000 (16:42 -0700)]
mm/compaction: cleanup on compaction_deferred

When CONFIG_COMPACTION is enabled, compaction_deferred() tries to
recalculate the deferred limit again, which isn't necessary.

When CONFIG_COMPACTION is disabled, compaction_deferred() should return
"true" or "false" since it has "bool" for its return value.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: make mem_cgroup_force_empty_list() return bool
KAMEZAWA Hiroyuki [Tue, 31 Jul 2012 23:42:46 +0000 (16:42 -0700)]
memcg: make mem_cgroup_force_empty_list() return bool

mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY
indicates 'you need to retry'.  Make mem_cgroup_force_empty_list() return
a bool to simplify the logic.

[akpm@linux-foundation.org: rework mem_cgroup_force_empty_list()'s comment]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: mem_cgroup_move_parent() doesn't need gfp_mask
KAMEZAWA Hiroyuki [Tue, 31 Jul 2012 23:42:45 +0000 (16:42 -0700)]
memcg: mem_cgroup_move_parent() doesn't need gfp_mask

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: clean up force_empty_list() return value check
Kamezawa Hiroyuki [Tue, 31 Jul 2012 23:42:44 +0000 (16:42 -0700)]
memcg: clean up force_empty_list() return value check

After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()"
mem_cgroup_move_parent() returns only -EBUSY or -EINVAL.  So we can remove
the -ENOMEM and -EINTR checks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: remove check for signal_pending() during rmdir()
Kamezawa Hiroyuki [Tue, 31 Jul 2012 23:42:42 +0000 (16:42 -0700)]
memcg: remove check for signal_pending() during rmdir()

After bf544fdc241da8 "memcg: move charges to root cgroup if
use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim
will occur when removing a memory cgroup.  If -EINTR is returned here,
cgroup will show a warning.

We don't need to handle any user interruption signal.  Remove this.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm/memblock.c:memblock_double_array(): cosmetic cleanups
Andrew Morton [Tue, 31 Jul 2012 23:42:40 +0000 (16:42 -0700)]
mm/memblock.c:memblock_double_array(): cosmetic cleanups

This function is an 80-column eyesore, quite unnecessarily.  Clean that
up, and use standard comment layout style.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm, oom: do not schedule if current has been killed
David Rientjes [Tue, 31 Jul 2012 23:42:37 +0000 (16:42 -0700)]
mm, oom: do not schedule if current has been killed

The oom killer currently schedules away from current in an uninterruptible
sleep if it does not have access to memory reserves.  It's possible that
current was killed because it shares memory with the oom killed thread or
because it was killed by the user in the interim, however.

This patch only schedules away from current if it does not have a pending
kill, i.e.  if it does not share memory with the oom killed thread.  It's
possible that it will immediately retry its memory allocation and fail,
but it will immediately be given access to memory reserves if it calls the
oom killer again.

This prevents the delay of memory freeing when threads that share memory
with the oom killed thread get unnecessarily scheduled.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:36 +0000 (16:42 -0700)]
hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate

We already hold the hugetlb_lock.  That should prevent a parallel cgroup
rmdir from touching page's hugetlb cgroup.  So remove the exclude and
wakeup calls.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:35 +0000 (16:42 -0700)]
hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.

A page's hugetlb cgroup assignment and movement to the active list should
occur with hugetlb_lock held.  Otherwise when we remove the hugetlb cgroup
we will iterate the active list and find pages with NULL hugetlb cgroup
values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: move all the in use pages to active list
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:32 +0000 (16:42 -0700)]
hugetlb: move all the in use pages to active list

When we fail to allocate pages from the reserve pool, hugetlb tries to
allocate huge pages using alloc_buddy_huge_page.  Add these to the active
list.  We also need to add the huge page we allocate when we soft offline
the oldpage to active list.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: add HugeTLB controller documentation
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:30 +0000 (16:42 -0700)]
hugetlb/cgroup: add HugeTLB controller documentation

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:27 +0000 (16:42 -0700)]
hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

With HugeTLB pages, hugetlb cgroup is uncharged in compound page
destructor.  Since we are holding a hugepage reference, we can be sure
that old page won't get uncharged till the last put_page().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: add hugetlb cgroup control files
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:24 +0000 (16:42 -0700)]
hugetlb/cgroup: add hugetlb cgroup control files

Add the control files for hugetlb controller

[akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
[akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: add support for cgroup removal
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:21 +0000 (16:42 -0700)]
hugetlb/cgroup: add support for cgroup removal

Add support for cgroup removal.  If we don't have parent cgroup, the
charges are moved to root cgroup.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: add charge/uncharge routines for hugetlb cgroup
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:18 +0000 (16:42 -0700)]
hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroup

Add the charge and uncharge routines for hugetlb cgroup.  We do cgroup
charging in page alloc and uncharge in compound page destructor.
Assigning page's hugetlb cgroup is protected by hugetlb_lock.

[liwp@linux.vnet.ibm.com: add huge_page_order check to avoid incorrect uncharge]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb/cgroup: add the cgroup pointer to page lru
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:15 +0000 (16:42 -0700)]
hugetlb/cgroup: add the cgroup pointer to page lru

Add the hugetlb cgroup pointer to 3rd page lru.next.  This limit the usage
to hugetlb cgroup to only hugepages with 3 or more normal pages.  I guess
that is an acceptable limitation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm/hugetlb: add new HugeTLB cgroup
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:12 +0000 (16:42 -0700)]
mm/hugetlb: add new HugeTLB cgroup

Implement a new controller that allows us to control HugeTLB allocations.
The extension allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault.  Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit.  This requires the application to know beforehand how
much HugeTLB pages it would require for its use.

The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.

[akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
[akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g]
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: make some static variables global
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:10 +0000 (16:42 -0700)]
hugetlb: make some static variables global

We will use them later in hugetlb_cgroup.c

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: add a list for tracking in-use HugeTLB pages
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:07 +0000 (16:42 -0700)]
hugetlb: add a list for tracking in-use HugeTLB pages

hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
On cgroup removal we update the page's HugeTLB cgroup to point to parent
cgroup.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: simplify migrate_huge_page()
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:06 +0000 (16:42 -0700)]
hugetlb: simplify migrate_huge_page()

Since we migrate only one hugepage, don't use linked list for passing the
page around.  Directly pass the page that need to be migrated as argument.
This also removes the usage of page->lru in the migrate path.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: use mmu_gather instead of a temporary linked list for accumulating pages
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:03 +0000 (16:42 -0700)]
hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

Use a mmu_gather instead of a temporary linked list for accumulating pages
when we unmap a hugepage range

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: add an inline helper for finding hstate index
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:42:00 +0000 (16:42 -0700)]
hugetlb: add an inline helper for finding hstate index

Add an inline helper and use it in the code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: don't use ERR_PTR with VM_FAULT* values
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:41:57 +0000 (16:41 -0700)]
hugetlb: don't use ERR_PTR with VM_FAULT* values

The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value.  Decouple the
VM_FAULT_* values from MAX_ERRNO.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agohugetlb: rename max_hstate to hugetlb_max_hstate
Aneesh Kumar K.V [Tue, 31 Jul 2012 23:41:54 +0000 (16:41 -0700)]
hugetlb: rename max_hstate to hugetlb_max_hstate

This patchset implements a cgroup resource controller for HugeTLB pages.
The controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault.  Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit.  This requires the application to know beforehand how
much HugeTLB pages it would require for its use.

The goal is to control how many HugeTLB pages a group of task can
allocate.  It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock.  HPC job scheduler requires jobs to specify their resource
requirements in the job file.  Once their requirements can be met, job
schedulers like (SLURM) will schedule the job.  We need to make sure that
the jobs won't consume more resources than requested.  If they do we
should either error out or kill the application.

This patch:

Rename max_hstate to hugetlb_max_hstate.  We will be using this from other
subsystems like hugetlb controller in later patches.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads
Wanpeng Li [Tue, 31 Jul 2012 23:41:52 +0000 (16:41 -0700)]
mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads

Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more.  But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm/buddy: cleanup on should_fail_alloc_page
Gavin Shan [Tue, 31 Jul 2012 23:41:51 +0000 (16:41 -0700)]
mm/buddy: cleanup on should_fail_alloc_page

Currently, function should_fail() has "bool" for its return value, so it's
reasonable to change the return value of function should_fail_alloc_page()
into "bool" as well.

The patch does cleanup on function should_fail_alloc_page() to have "bool"
for its return value.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm: account the total_vm in the vm_stat_account()
Huang Shijie [Tue, 31 Jul 2012 23:41:49 +0000 (16:41 -0700)]
mm: account the total_vm in the vm_stat_account()

vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.

Even for mprotect_fixup(), we can get the right result in the end.

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agodocumentation: update how page-cluster affects swap I/O
Christian Ehrhardt [Tue, 31 Jul 2012 23:41:46 +0000 (16:41 -0700)]
documentation: update how page-cluster affects swap I/O

Fix of the documentation of /proc/sys/vm/page-cluster to match the
behavior of the code and add some comments about what the tunable will
change in that behavior.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agoswap: allow swap readahead to be merged
Christian Ehrhardt [Tue, 31 Jul 2012 23:41:44 +0000 (16:41 -0700)]
swap: allow swap readahead to be merged

Swap readahead works fine, but the I/O to disk is almost always done in
page size requests, despite the fact that readahead submits
1<<page-cluster pages at a time.

On older kernels the old per device plugging behavior might have captured
this and merged the requests, but currently all comes down to much more
I/Os than required.

On a single device this might not be an issue, but as soon as a server
runs on shared san resources savin I/Os not only improves swapin
throughput but also provides a lower resource utilization.

With a load running KVM in a lot of memory overcommitment (the hot memory
is 1.5 times the host memory) swapping throughput improves significantly
and the lead feels more responsive as well as achieves more throughput.

In a test setup with 16 swap disks running blocktrace on one of those disks
shows the improved merging:
Prior:
Reads Queued:     560,888,    2,243MiB  Writes Queued:     226,242,  904,968KiB
Read Dispatches:  544,701,    2,243MiB  Write Dispatches:  159,318,  904,968KiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  544,716,    2,243MiB  Writes Completed:  159,321,  904,980KiB
Read Merges:       16,187,   64,748KiB  Write Merges:       61,744,  246,976KiB
IO unplugs:       149,614               Timer unplugs:       2,940

With the patch:
Reads Queued:     734,315,    2,937MiB  Writes Queued:     300,188,    1,200MiB
Read Dispatches:  214,972,    2,937MiB  Write Dispatches:  215,176,    1,200MiB
Reads Requeued:         0               Writes Requeued:         0
Reads Completed:  214,971,    2,937MiB  Writes Completed:  215,177,    1,200MiB
Read Merges:      519,343,    2,077MiB  Write Merges:       73,325,  293,300KiB
IO unplugs:       337,130               Timer unplugs:      11,184

I got ~10% to ~40% more throughput in my cases and at the same time much
lower cpu consumption when broken down per transferred kilobyte (the
majority of that due to saved interrupts and better cache handling).  In a
shared SAN others might get an additional benefit as well, because this
now causes less protocol overhead.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: remove MEM_CGROUP_CHARGE_TYPE_FORCE
Kamezawa Hiroyuki [Tue, 31 Jul 2012 23:41:41 +0000 (16:41 -0700)]
memcg: remove MEM_CGROUP_CHARGE_TYPE_FORCE

There are no users since commit b24028572fb69 ("memcg: remove PCG_CACHE").

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: rename MEM_CGROUP_CHARGE_TYPE_MAPPED as MEM_CGROUP_CHARGE_TYPE_ANON
Kamezawa Hiroyuki [Tue, 31 Jul 2012 23:41:40 +0000 (16:41 -0700)]
memcg: rename MEM_CGROUP_CHARGE_TYPE_MAPPED as MEM_CGROUP_CHARGE_TYPE_ANON

Now, in memcg, 2 "MAPPED" enum/macro are found
 MEM_CGROUP_CHARGE_TYPE_MAPPED
 MEM_CGROUP_STAT_FILE_MAPPED

Thier names looks similar to each other but the former is used for
accounting anonymous memory. rename it as TYPE_ANON.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomemcg: rename MEM_CGROUP_STAT_SWAPOUT as MEM_CGROUP_STAT_SWAP
Kamezawa Hiroyuki [Tue, 31 Jul 2012 23:41:38 +0000 (16:41 -0700)]
memcg: rename MEM_CGROUP_STAT_SWAPOUT as MEM_CGROUP_STAT_SWAP

MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than
the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agomm: make vb_alloc() more foolproof
Jan Kara [Tue, 31 Jul 2012 23:41:37 +0000 (16:41 -0700)]
mm: make vb_alloc() more foolproof

If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate
0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT
and interesting stuff happens.  So make debugging such problems easier and
warn about 0-size allocation.

[akpm@linux-foundation.org: use WARN_ON-return-value feature]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agovmalloc: walk vmap_areas by sorted list instead of rb_next()
Hong zhi guo [Tue, 31 Jul 2012 23:41:35 +0000 (16:41 -0700)]
vmalloc: walk vmap_areas by sorted list instead of rb_next()

There's a walk by repeating rb_next to find a suitable hole.  Could be
simply replaced by walk on the sorted vmap_area_list.  More simpler and
efficient.

Mutation of the list and tree only happens in pair within
__insert_vmap_area and __free_vmap_area, under protection of
vmap_area_lock.  The patch code is also under vmap_area_lock, so the list
walk is safe, and consistent with the tree walk.

Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and
rounds for hours.

Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agodrivers/media/video/v4l2-ioctl.c: fix build
Andrew Morton [Tue, 31 Jul 2012 23:41:34 +0000 (16:41 -0700)]
drivers/media/video/v4l2-ioctl.c: fix build

Fix zillions of these:

drivers/media/video/v4l2-ioctl.c:1848: error: unknown field 'func' specified in initializer
drivers/media/video/v4l2-ioctl.c:1848: warning: missing braces around initializer
drivers/media/video/v4l2-ioctl.c:1848: warning: (near initialization for 'v4l2_ioctls[0].<anonymous>')
drivers/media/video/v4l2-ioctl.c:1848: warning: initialization makes integer from pointer without a cast
drivers/media/video/v4l2-ioctl.c:1848: error: initializer element is not computable at load time
drivers/media/video/v4l2-ioctl.c:1848: error: (near initialization for 'v4l2_ioctls[0].<anonymous>.offset')

Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agoxtensa: select generic atomic64_t support
Fengguang Wu [Tue, 31 Jul 2012 23:41:33 +0000 (16:41 -0700)]
xtensa: select generic atomic64_t support

This will fix build errors:

block/blk-cgroup.c:609:2: error: unknown type name 'atomic64_t'
block/blk-cgroup.c:609:2: error: implicit declaration of function 'ATOMIC64_INIT' [-Werror=implicit-function-declaration]

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agofault-injection: fix failcmd.sh warning
Akinobu Mita [Tue, 31 Jul 2012 23:41:31 +0000 (16:41 -0700)]
fault-injection: fix failcmd.sh warning

"fault-injection: add tool to run command with failslab or
fail_page_alloc" added tools/testing/fault-injection/failcmd.sh to make it
easier to inject slab/page allocation failures by fault injection.

failcmd.sh prints the following warning when running with arguments
for command.

# ./failcmd.sh echo aaa
failcmd.sh: line 209: [: echo: binary operator expected
aaa

This warning is caused by an improper check whether at least one
parameter is left after parsing command options.

Fix it by testing the length of $1 instead of $@

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 years agoMerge tag 'for-v3.6' of git://git.infradead.org/battery-2.6
Linus Torvalds [Wed, 1 Aug 2012 01:08:25 +0000 (18:08 -0700)]
Merge tag 'for-v3.6' of git://git.infradead.org/battery-2.6

Pull battery updates from Anton Vorontsov:
 "The tag contains just a few battery-related changes for v3.6.  It's is
  all pretty straightforward, except one thing.

  One of our patches added thermal support for power supply class, but
  thermal/ subsystem changed under our feet.  We (well, Stephen, that
  is) caught the issue and it was decided[1] that I'd just delay the
  battery pull request, and then will fix it up by merging upstream back
  into battery tree at the specific commit.

  That's not all though: another[2] small fixup for thermal subsystem
  was needed to get rid of a warning in power supply subsystem (the
  warning was not drivers/power's "fault", the thermal registration
  function just needed a proper const annotation, which is also done by
  a small commit on top of the merge.

  So, to sum this up:
   - The 'master' branch of the battery tree was in the -next tree for
     weeks, was never rebased, altered etc.  It should be all OK;
   - Although, for-v3.6 tag contains the 'master' branch + merge + the
     warning fix.

  [1] http://lkml.org/lkml/2012/6/19/23
  [2] http://lkml.org/lkml/2012/6/18/28"

* tag 'for-v3.6' of git://git.infradead.org/battery-2.6: (23 commits)
  thermal: Constify 'type' argument for the registration routine
  olpc-battery: update CHARGE_FULL_DESIGN property for BYD LiFe batteries
  olpc-battery: Add VOLTAGE_MAX_DESIGN property
  charger-manager: Fix build break related to EXTCON
  lp8727_charger: Move header file into platform_data directory
  power_supply: Add min/max alert properties for CAPACITY, TEMP, TEMP_AMBIENT
  bq27x00_battery: Add support for BQ27425 chip
  charger-manager: Set current limit of regulator for over current protection
  charger-manager: Use EXTCON Subsystem to detect charger cables for charging
  test_power: Add VOLTAGE_NOW and BATTERY_TEMP properties
  test_power: Add support for USB AC source
  gpio-charger: Use cansleep version of gpio_set_value
  bq27x00_battery: Add support for power average and health properties
  sbs-battery: Don't trigger false supply_changed event
  twl4030_charger: Allow charger to control the regulator that feeds it
  twl4030_charger: Add backup-battery charging
  twl4030_charger: Fix some typos
  max17042_battery: Support CHARGE_COUNTER power supply attribute
  smb347-charger: Add constant charge and current properties
  power_supply: Add constant charge_current and charge_voltage properties
  ...

11 years agoMerge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 31 Jul 2012 22:34:13 +0000 (15:34 -0700)]
Merge branch 'perf-core-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:
 "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
  updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
  fixes) now that Arnaldo is back from vacation."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  uprobes: __replace_page() needs munlock_vma_page()
  uprobes: Rename vma_address() and make it return "unsigned long"
  uprobes: Fix register_for_each_vma()->vma_address() check
  uprobes: Introduce vaddr_to_offset(vma, vaddr)
  uprobes: Teach build_probe_list() to consider the range
  uprobes: Remove insert_vm_struct()->uprobe_mmap()
  uprobes: Remove copy_vma()->uprobe_mmap()
  uprobes: Fix overflow in vma_address()/find_active_uprobe()
  uprobes: Suppress uprobe_munmap() from mmput()
  uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
  uprobes: Clean up and document write_opcode()->lock_page(old_page)
  uprobes: Kill write_opcode()->lock_page(new_page)
  uprobes: __replace_page() should not use page_address_in_vma()
  uprobes: Don't recheck vma/f_mapping in write_opcode()
  perf/x86: Fix missing struct before structure name
  perf/x86: Fix format definition of SNB-EP uncore QPI box
  perf/x86: Make bitfield unsigned
  perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
  perf/x86: Add Intel Nehalem-EX uncore support
  perf/x86: Fix typo in format definition of uncore PCU filter
  ...

11 years agoMerge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Linus Torvalds [Tue, 31 Jul 2012 22:33:04 +0000 (15:33 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc

Pull powerpc updates from Benjamin Herrenschmidt:
 "Kumar sent me a handful of Freescale related fixes and I added another
  regression fix to the pile.

  PS.  I -will- eventually learn about that signed tag business :-)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/kvm/book3s_32: Fix MTMSR_EERI macro
  powerpc/85xx: p1022ds: fix DIU/LBC switching with NAND enabled
  powerpc/85xx: p1022ds: disable the NAND flash node if video is enabled
  powerpc/85xx: Fix sram_offset parameter type
  powerpc/85xx: P3041DS - change espi input-clock from 40MHz to 35MHz
  powerpc/85xx: Fix pci base address error for p2020rdb-pc in dts

11 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Tue, 31 Jul 2012 22:32:05 +0000 (15:32 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux

Pull s390 updates from Martin Schwidefsky:
 "This it the second batch of s390 patches for the 3.6 merge window.
  Included is enablement for two common code changes, killable page
  faults and sorted exception tables.  And the regular set of cleanup
  and bug fix patches."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390: make use of user_mode() macro where possible
  s390/mm: rename user_mode variable to addressing_mode
  s390/mm: fix fault handling for page table walk case
  s390/mm: make page faults killable
  s390: update defconfig
  s390/mm: downgrade page table after fork of a 31 bit process
  s390/ipl: Use diagnose 8 command separation
  s390/linker script: use RO_DATA_SECTION
  s390/exceptions: sort exception table at build time
  s390/debug: remove module_exit function / move EXPORT_SYMBOLs

11 years agoipv4: Properly purge netdev references on uncached routes.
David S. Miller [Tue, 31 Jul 2012 22:06:50 +0000 (15:06 -0700)]
ipv4: Properly purge netdev references on uncached routes.

When a device is unregistered, we have to purge all of the
references to it that may exist in the entire system.

If a route is uncached, we currently have no way of accomplishing
this.

So create a global list that is scanned when a network device goes
down.  This mirrors the logic in net/core/dst.c's dst_ifdown().

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoipv4: Cache routes in nexthop exception entries.
David S. Miller [Tue, 31 Jul 2012 22:02:02 +0000 (15:02 -0700)]
ipv4: Cache routes in nexthop exception entries.

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Linus Torvalds [Tue, 31 Jul 2012 21:42:28 +0000 (14:42 -0700)]
Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux

Pull nfsd changes from J. Bruce Fields:
 "This has been an unusually quiet cycle--mostly bugfixes and cleanup.
  The one large piece is Stanislav's work to containerize the server's
  grace period--but that in itself is just one more step in a
  not-yet-complete project to allow fully containerized nfs service.

  There are a number of outstanding delegation, container, v4 state, and
  gss patches that aren't quite ready yet; 3.7 may be wilder."

* 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits)
  NFSd: make boot_time variable per network namespace
  NFSd: make grace end flag per network namespace
  Lockd: move grace period management from lockd() to per-net functions
  LockD: pass actual network namespace to grace period management functions
  LockD: manage grace list per network namespace
  SUNRPC: service request network namespace helper introduced
  NFSd: make nfsd4_manager allocated per network namespace context.
  LockD: make lockd manager allocated per network namespace
  LockD: manage grace period per network namespace
  Lockd: add more debug to host shutdown functions
  Lockd: host complaining function introduced
  LockD: manage used host count per networks namespace
  LockD: manage garbage collection timeout per networks namespace
  LockD: make garbage collector network namespace aware.
  LockD: mark host per network namespace on garbage collect
  nfsd4: fix missing fault_inject.h include
  locks: move lease-specific code out of locks_delete_lock
  locks: prevent side-effects of locks_release_private before file_lock is initialized
  NFSd: set nfsd_serv to NULL after service destruction
  NFSd: introduce nfsd_destroy() helper
  ...

11 years agoipv4: percpu nh_rth_output cache
Eric Dumazet [Tue, 31 Jul 2012 05:45:30 +0000 (05:45 +0000)]
ipv4: percpu nh_rth_output cache

Input path is mostly run under RCU and doesnt touch dst refcnt

But output path on forwarding or UDP workloads hits
badly dst refcount, and we have lot of false sharing, for example
in ipv4_mtu() when reading rt->rt_pmtu

Using a percpu cache for nh_rth_output gives a nice performance
increase at a small cost.

24 udpflood test on my 24 cpu machine (dummy0 output device)
(each process sends 1.000.000 udp frames, 24 processes are started)

before : 5.24 s
after : 2.06 s
For reference, time on linux-3.5 : 6.60 s

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoipv4: Restore old dst_free() behavior.
Eric Dumazet [Tue, 31 Jul 2012 01:08:23 +0000 (01:08 +0000)]
ipv4: Restore old dst_free() behavior.

commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
to solve a race but added a problem at device/fib dismantle time :

We really want to call dst_free() as soon as possible, even if sockets
still have dst in their cache.
dst_release() calls in free_fib_info_rcu() are not welcomed.

Root of the problem was that now we also cache output routes (in
nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
rt_free(), because output route lookups are done in process context.

Based on feedback and initial patch from David Miller (adding another
call_rcu_bh() call in fib, but it appears it was not the right fix)

I left the inet_sk_rx_dst_set() helper and added __rcu attributes
to nh_rth_output and nh_rth_input to better document what is going on in
this code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph...
Linus Torvalds [Tue, 31 Jul 2012 21:35:28 +0000 (14:35 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client

Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...

11 years ago[media] radio-tea5777: use library for 64bits div
Mauro Carvalho Chehab [Tue, 31 Jul 2012 19:38:41 +0000 (16:38 -0300)]
[media] radio-tea5777: use library for 64bits div

drivers/built-in.o: In function `radio_tea5777_set_freq':
radio-tea5777.c:(.text+0x4d8704): undefined reference to `__udivdi3'

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Hans de Goede <hdegoede@redhat.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years agonfs: explicitly reject LOCK_MAND flock() requests
Jeff Layton [Mon, 23 Jul 2012 19:46:23 +0000 (15:46 -0400)]
nfs: explicitly reject LOCK_MAND flock() requests

We have no mechanism to emulate LOCK_MAND locks on NFSv4, so explicitly
return -EINVAL if someone requests it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
11 years agonfs: increase number of permitted callback connections.
NeilBrown [Tue, 31 Jul 2012 04:40:12 +0000 (14:40 +1000)]
nfs: increase number of permitted callback connections.

By default a sunrpc service is limited to (N+3)*20 connections
where N is the number of threads.  This is 80 when N==1.
If this number is exceeded a warning is printed suggesting that
the number of threads be increased.  However with services which
run a single thread, this is impossible.

For such services there is a ->sv_maxconn setting that can be
used to forcibly increase the limit, and silence the message.
This is used by lockd.

The nfs client uses a sunrpc service to handle callbacks and
it too is single-threaded, so to avoid the useless messages,
and to allow a reasonable number of concurrent connections,
we need to set ->sv_maxconn.  1024 seems like a good number.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
11 years agovfio: Add PCI device driver
Alex Williamson [Tue, 31 Jul 2012 14:16:24 +0000 (08:16 -0600)]
vfio: Add PCI device driver

Add PCI device support for VFIO.  PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device.  PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers.  I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions.  Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
11 years agovfio: Type1 IOMMU implementation
Alex Williamson [Tue, 31 Jul 2012 14:16:23 +0000 (08:16 -0600)]
vfio: Type1 IOMMU implementation

This VFIO IOMMU backend is designed primarily for AMD-Vi and Intel
VT-d hardware, but is potentially usable by anything supporting
similar mapping functionality.  We arbitrarily call this a Type1
backend for lack of a better name.  This backend has no IOVA
or host memory mapping restrictions for the user and is optimized
for relatively static mappings.  Mapped areas are pinned into system
memory.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
11 years agovfio: Add documentation
Alex Williamson [Tue, 31 Jul 2012 14:16:23 +0000 (08:16 -0600)]
vfio: Add documentation

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
11 years agovfio: VFIO core
Alex Williamson [Tue, 31 Jul 2012 14:16:22 +0000 (08:16 -0600)]
vfio: VFIO core

VFIO is a secure user level driver for use with both virtual machines
and user level drivers.  VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access.  It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).

New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface.  We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model.  IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms.  VFIO
users are able to query and initialize the IOMMU model of their
choice.

Please see the follow-on Documentation commit for further description
and usage example.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
11 years agothermal: Constify 'type' argument for the registration routine
Anton Vorontsov [Tue, 31 Jul 2012 11:39:30 +0000 (04:39 -0700)]
thermal: Constify 'type' argument for the registration routine

thermal_zone_device_register() does not modify 'type' argument, so it is
safe to declare it as const. Otherwise, if we pass a const string, we are
getting the ugly warning:

CC drivers/power/power_supply_core.o
drivers/power/power_supply_core.c: In function 'psy_register_thermal':
drivers/power/power_supply_core.c:204:6: warning: passing argument 1 of 'thermal_zone_device_register' discards 'const' qualifier from pointer target type [enabled by default]
include/linux/thermal.h:140:29: note: expected 'char *' but argument is of type 'const char *'

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Jean Delvare <khali@linux-fr.org>
11 years agoMerge with upstream to accommodate with thermal changes
Anton Vorontsov [Tue, 31 Jul 2012 11:59:42 +0000 (04:59 -0700)]
Merge ... upstream to accommodate with thermal changes

This merge is performed to take commit c56f5c0342dfee11a1 ("Thermal: Make
Thermal trip points writeable") out of Linus' tree and then fixup power
supply class. This is needed since thermal stuff added a new argument:

  CC      drivers/power/power_supply_core.o
drivers/power/power_supply_core.c: In function ‘psy_register_thermal’:
drivers/power/power_supply_core.c:204:6: warning: passing argument 3 of ‘thermal_zone_device_register’ makes integer from pointer without a cast [enabled by default]
include/linux/thermal.h:154:29: note: expected ‘int’ but argument is of type ‘struct power_supply *’
drivers/power/power_supply_core.c:204:6: error: too few arguments to function ‘thermal_zone_device_register’
include/linux/thermal.h:154:29: note: declared here
make[1]: *** [drivers/power/power_supply_core.o] Error 1
make: *** [drivers/power/] Error 2

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
11 years agoMerge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy...
Jens Axboe [Tue, 31 Jul 2012 09:47:36 +0000 (11:47 +0200)]
Merge branch 'upstream' of git://git./linux/kernel/git/jikos/floppy into for-3.6/drivers

11 years agofloppy: remove duplicated flag FD_RAW_NEED_DISK
Fengguang Wu [Sat, 28 Jul 2012 11:45:59 +0000 (19:45 +0800)]
floppy: remove duplicated flag FD_RAW_NEED_DISK

Fix coccinelle warning (without behavior change):

drivers/block/floppy.c:2518:32-48: duplicated argument to & or |

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
11 years agoALSA: hda - Fix mute-LED GPIO initialization for IDT codecs
Takashi Iwai [Tue, 31 Jul 2012 08:40:05 +0000 (10:40 +0200)]
ALSA: hda - Fix mute-LED GPIO initialization for IDT codecs

The IDT codecs initializes the GPIO setup for mute LEDs via
snd_hda_sync_vmaster_hook().  This works in most cases except for the
very first call, which is called before PCM and control creations.
Thus before Master switch is set manually via alsactl, the mute LED
may show the wrong state, depending on the polarity.

Now it's fixed by calling the LED-status update function manually when
no vmaster is set yet.

Cc: <stable@vger.kernel.org> [v3.4+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
11 years agoALSA: hda - Add descriptions for missing IDT 92HD83x models
Takashi Iwai [Tue, 31 Jul 2012 08:26:34 +0000 (10:26 +0200)]
ALSA: hda - Add descriptions for missing IDT 92HD83x models

Signed-off-by: Takashi Iwai <tiwai@suse.de>
11 years agoALSA: hda - Fix polarity of mute LED on HP Mini 210
Takashi Iwai [Tue, 31 Jul 2012 08:16:59 +0000 (10:16 +0200)]
ALSA: hda - Fix polarity of mute LED on HP Mini 210

The commit a3e199732b made the LED working again on HP Mini 210 but
with a wrong polarity.  This patch fixes the polarity for this
machine, and also introduce a new model string "hp-inv-led".

Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=772923

Cc: <stable@vger.kernel.org> [v3.3+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
11 years agoblk: pass from_schedule to non-request unplug functions.
NeilBrown [Tue, 31 Jul 2012 07:08:15 +0000 (09:08 +0200)]
blk: pass from_schedule to non-request unplug functions.

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 years agoblock: stack unplug
Shaohua Li [Tue, 31 Jul 2012 07:08:15 +0000 (09:08 +0200)]
block: stack unplug

MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 years agoblk: centralize non-request unplug handling.
NeilBrown [Tue, 31 Jul 2012 07:08:14 +0000 (09:08 +0200)]
blk: centralize non-request unplug handling.

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 years agomd: remove plug_cnt feature of plugging.
NeilBrown [Tue, 31 Jul 2012 07:08:14 +0000 (09:08 +0200)]
md: remove plug_cnt feature of plugging.

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 years agoblock/nbd: micro-optimization in nbd request completion
Chetan Loke [Tue, 31 Jul 2012 06:47:13 +0000 (08:47 +0200)]
block/nbd: micro-optimization in nbd request completion

Add in-flight cmds to the tail. That way while searching
(during request completion),we will always get a hit on the
first element.

Signed-off-by: Chetan Loke <loke.chetan@gmail.com>
Acked-by: Paul.Clements@steeleye.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 years agoDocumentation: Correct s_umount state for freeze_fs/unfreeze_fs
Valerie Aurora [Tue, 12 Jun 2012 14:20:48 +0000 (16:20 +0200)]
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs

freeze_fs/unfreeze_fs ops are called with s_umount held for write, not read.

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agofs: Remove old freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:47 +0000 (16:20 +0200)]
fs: Remove old freezing mechanism

Now that all users are converted, we can remove functions, variables, and
constants defined by the old freezing mechanism.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agoext2: Implement freezing
Jan Kara [Tue, 12 Jun 2012 14:20:46 +0000 (16:20 +0200)]
ext2: Implement freezing

The only missing piece to make freezing work reliably with ext2 is to
stop iput() of unlinked inode from deleting the inode on frozen filesystem.
So add a necessary protection to ext2_evict_inode().

We also provide appropriate ->freeze_fs and ->unfreeze_fs functions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agobtrfs: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:45 +0000 (16:20 +0200)]
btrfs: Convert to new freezing mechanism

We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agonilfs2: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:44 +0000 (16:20 +0200)]
nilfs2: Convert to new freezing mechanism

We change nilfs_page_mkwrite() to provide proper freeze protection for
writeable page faults (we must wait for frozen filesystem even if the
page is fully mapped).

We remove all vfs_check_frozen() checks since they are now handled by
the generic code.

CC: linux-nilfs@vger.kernel.org
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agontfs: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:43 +0000 (16:20 +0200)]
ntfs: Convert to new freezing mechanism

Move check in ntfs_file_aio_write_nolock() to ntfs_file_aio_write() and
use new freeze protection.

CC: linux-ntfs-dev@lists.sourceforge.net
CC: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agofuse: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:42 +0000 (16:20 +0200)]
fuse: Convert to new freezing mechanism

Convert check in fuse_file_aio_write() to using new freeze protection.

CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agogfs2: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:41 +0000 (16:20 +0200)]
gfs2: Convert to new freezing mechanism

We update gfs2_page_mkwrite() to use new freeze protection and the transaction
code to use freeze protection while the transaction is running. That is needed
to stop iput() of unlinked file from modifying the filesystem. The rest is
handled by the generic code.

CC: cluster-devel@redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agoocfs2: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:40 +0000 (16:20 +0200)]
ocfs2: Convert to new freezing mechanism

Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new freeze
protection. We also protect several ioctl entry points which were missing the
protection. Finally, we add freeze protection to the journaling mechanism so
that iput() of unlinked inode cannot modify a frozen filesystem.

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agoxfs: Convert to new freezing code
Jan Kara [Tue, 12 Jun 2012 14:20:39 +0000 (16:20 +0200)]
xfs: Convert to new freezing code

Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agoext4: Convert to new freezing mechanism
Jan Kara [Tue, 12 Jun 2012 14:20:38 +0000 (16:20 +0200)]
ext4: Convert to new freezing mechanism

We remove most of frozen checks since upper layer takes care of blocking all
writes. We have to handle protection in ext4_page_mkwrite() in a special way
because we cannot use generic block_page_mkwrite(). Also we add a freeze
protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify
a frozen filesystem (we cannot easily instrument ext4_journal_start() /
ext4_journal_stop() with freeze protection because we are missing the
superblock pointer in ext4_journal_stop() in nojournal mode).

CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agofs: Protect write paths by sb_start_write - sb_end_write
Jan Kara [Tue, 12 Jun 2012 14:20:37 +0000 (16:20 +0200)]
fs: Protect write paths by sb_start_write - sb_end_write

There are several entry points which dirty pages in a filesystem.  mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().

->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agofs: Skip atime update on frozen filesystem
Jan Kara [Tue, 12 Jun 2012 14:20:36 +0000 (16:20 +0200)]
fs: Skip atime update on frozen filesystem

It is unexpected to block reading of frozen filesystem because of atime update.
Also handling blocking on frozen filesystem because of atime update would make
locking more complex than it already is. So just skip atime update when
filesystem is frozen like we skip it when filesystem is remounted read-only.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agofs: Add freezing handling to mnt_want_write() / mnt_drop_write()
Jan Kara [Tue, 12 Jun 2012 14:20:35 +0000 (16:20 +0200)]
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()

Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agofs: Improve filesystem freezing handling
Jan Kara [Tue, 12 Jun 2012 14:20:34 +0000 (16:20 +0200)]
fs: Improve filesystem freezing handling

vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.

Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.

This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are three such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite), one for all other writers, and one of
internal filesystem purposes (used e.g. to track running transactions).  Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.

Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.

[AV: microoptimize sb_start_write(); we want it fast in normal case]

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agoswitch the protection of percpu_counter list to spinlock
Al Viro [Tue, 31 Jul 2012 05:28:31 +0000 (09:28 +0400)]
switch the protection of percpu_counter list to spinlock

... making percpu_counter_destroy() non-blocking

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 years agopowerpc/kvm/book3s_32: Fix MTMSR_EERI macro
Alexander Graf [Mon, 30 Jul 2012 12:32:59 +0000 (12:32 +0000)]
powerpc/kvm/book3s_32: Fix MTMSR_EERI macro

Commit b38c77d82e4 moved the MTMSR_EERI macro from the KVM code to generic
ppc_asm.h code. However, while adding it in the headers for the ppc32 case,
it missed out to remove the former definition in the KVM code.

This patch fixes compilation on server type PPC32 targets with CONFIG_KVM
enabled.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
11 years agoMerge remote-tracking branch 'kumar/merge' into merge
Benjamin Herrenschmidt [Tue, 31 Jul 2012 05:18:31 +0000 (15:18 +1000)]
Merge remote-tracking branch 'kumar/merge' into merge

Kumar says:

"A few patches that missed the initial 3.6 window.  These are bug fixes at
this point."

11 years agoMerge tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 31 Jul 2012 05:14:04 +0000 (22:14 -0700)]
Merge tag 'writeback-proportions' of git://git./linux/kernel/git/wfg/linux

Pull writeback updates from Wu Fengguang:
 "Use time based periods to age the writeback proportions, which can
  adapt equally well to fast/slow devices."

Fix up trivial conflict in comment in fs/sync.c

* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Fix some comment errors
  block: Convert BDI proportion calculations to flexible proportions
  lib: Fix possible deadlock in flexible proportion code
  lib: Proportions with flexible period

11 years ago[media] tlg2300: Declare MODULE_FIRMWARE usage
Tim Gardner [Wed, 25 Jul 2012 18:41:04 +0000 (15:41 -0300)]
[media] tlg2300: Declare MODULE_FIRMWARE usage

Cc: Huang Shijie <shijie8@gmail.com>
Cc: Kang Yong <kangyong@telegent.com>
Cc: Zhang Xiaobing <xbzhang@telegent.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: linux-media@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years ago[media] lgs8gxx: Declare MODULE_FIRMWARE usage
Tim Gardner [Wed, 25 Jul 2012 15:54:02 +0000 (12:54 -0300)]
[media] lgs8gxx: Declare MODULE_FIRMWARE usage

Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: linux-media@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years ago[media] xc5000: Add MODULE_FIRMWARE statements
Tim Gardner [Wed, 25 Jul 2012 12:15:19 +0000 (09:15 -0300)]
[media] xc5000: Add MODULE_FIRMWARE statements

This will make modinfo more useful with regard
to discovering necessary firmware files.

Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michael Krufky <mkrufky@kernellabs.com>
Cc: Eddi De Pieri <eddi@depieri.net>
Cc: linux-media@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years ago[media] s2255drv: Add MODULE_FIRMWARE statement
Tim Gardner [Tue, 24 Jul 2012 19:52:27 +0000 (16:52 -0300)]
[media] s2255drv: Add MODULE_FIRMWARE statement

Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Dean Anderson <linux-dev@sensoray.com>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: linux-media@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years agoMerge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Linus Torvalds [Tue, 31 Jul 2012 02:16:57 +0000 (19:16 -0700)]
Merge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Features include:
   - More preparatory patches for modularising NFSv2/v3/v4.  Split out
     the various NFSv2/v3/v4-specific code into separate files
   - More preparation for the NFSv4 migration code
   - Ensure that OPEN(O_CREATE) observes the pNFS mds threshold
     parameters
   - pNFS fast failover when the data servers are down
   - Various cleanups and debugging patches"

* tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits)
  nfs: fix fl_type tests in NFSv4 code
  NFS: fix pnfs regression with directio writes
  NFS: fix pnfs regression with directio reads
  sunrpc: clnt: Add missing braces
  nfs: fix stub return type warnings
  NFS: exit_nfs_v4() shouldn't be an __exit function
  SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors
  NFS: Split out NFS v4 client functions
  NFS: Split out the NFS v4 filesystem types
  NFS: Create a single nfs_clone_super() function
  NFS: Split out NFS v4 server creating code
  NFS: Initialize the NFS v4 client from init_nfs_v4()
  NFS: Move the v4 getroot code to nfs4getroot.c
  NFS: Split out NFS v4 file operations
  NFS: Initialize v4 sysctls from nfs_init_v4()
  NFS: Create an init_nfs_v4() function
  NFS: Split out NFS v4 inode operations
  NFS: Split out NFS v3 inode operations
  NFS: Split out NFS v2 inode operations
  NFS: Clean up nfs4_proc_setclientid() and friends
  ...

11 years ago[media] dib8000: move dereference after check for NULL
Dan Carpenter [Fri, 20 Jul 2012 10:11:57 +0000 (07:11 -0300)]
[media] dib8000: move dereference after check for NULL

My static checker complains that we dereference "state" inside the call
to fft_to_mode() before checking for NULL.  The comments say that it is
possible for "state" to be NULL so I have moved the dereference after
the check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years ago[media] Documentation: Update cardlists
Mauro Carvalho Chehab [Tue, 31 Jul 2012 02:10:38 +0000 (23:10 -0300)]
[media] Documentation: Update cardlists

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years ago[media] bttv: add support for Aposonic W-DVR
Tony Gentile [Thu, 19 Jul 2012 12:36:33 +0000 (09:36 -0300)]
[media] bttv: add support for Aposonic W-DVR

Forwarded-by: Gerd Hoffmann <kraxel@bytesex.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
11 years agoMerge tag 'mfd-for-linus-3.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 31 Jul 2012 02:06:25 +0000 (19:06 -0700)]
Merge tag 'mfd-for-linus-3.6-1' of git://git./linux/kernel/git/sameo/mfd-2.6

Pull MFD fix from Samuel Ortiz:
 "This one fixes an s5m8767 regulator build breakage due to a merge
  conflict caused by the MFD s5m API changes."

* tag 'mfd-for-linus-3.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
  regulator: Fix an s5m8767 build failure

11 years ago[media] cx25821: Remove bad strcpy to read-only char*
Ezequiel García [Wed, 18 Jul 2012 16:41:11 +0000 (13:41 -0300)]
[media] cx25821: Remove bad strcpy to read-only char*

The strcpy was being used to set the name of the board.
This was both wrong and redundant,
since the destination char* was read-only and
the name is set statically at compile time.

The type of the name field is changed to const char*
to prevent future errors.

Reported-by: Radek Masin <radek@masin.eu>
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>