David Rientjes [Fri, 13 Jan 2012 01:18:52 +0000 (17:18 -0800)]
oom, memcg: fix exclusion of memcg threads after they have detached their mm
The oom killer relies on logic that identifies threads that have already
been oom killed when scanning the tasklist and, if found, deferring
until such threads have exited. This is done by checking for any
candidate threads that have the TIF_MEMDIE bit set.
For memcg ooms, candidate threads are first found by calling
task_in_mem_cgroup() since the oom killer should not defer if there's an
oom killed thread in another memcg.
Unfortunately, task_in_mem_cgroup() excludes threads if they have
detached their mm in the process of exiting so TIF_MEMDIE is never
detected for such conditions. This is different for global, mempolicy,
and cpuset oom conditions where a detached mm is only excluded after
checking for TIF_MEMDIE and deferring, if necessary, in
select_bad_process().
The fix is to return true if a task has a detached mm but is still in
the memcg or its hierarchy that is currently oom. This will allow the
oom killer to appropriately defer rather than kill unnecessarily or, in
the worst case, panic the machine if nothing else is available to kill.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 13 Jan 2012 01:18:50 +0000 (17:18 -0800)]
memcg: free entries in soft_limit_tree if allocation fails
If we are not able to allocate tree nodes for all NUMA nodes then we
should release those that were allocated.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bob Liu [Fri, 13 Jan 2012 01:18:48 +0000 (17:18 -0800)]
page_cgroup: add helper function to get swap_cgroup
There are multiple places which need to get the swap_cgroup address, so
add a helper function:
static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent,
struct swap_cgroup_ctrl **ctrl);
to simplify the code.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:45 +0000 (17:18 -0800)]
mm: memcg: remove unneeded checks from uncharge_page()
mem_cgroup_uncharge_page() is only called on either freshly allocated
pages without page->mapping or on rmapped PageAnon() pages. There is no
need to check for a page->mapping that is not an anon_vma.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:43 +0000 (17:18 -0800)]
mm: memcg: remove unneeded checks from newpage_charge()
All callsites pass in freshly allocated pages and a valid mm. As a
result, all checks pertaining to the page's mapcount, page->mapping or the
fallback to init_mm are unneeded.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:40 +0000 (17:18 -0800)]
mm: page_cgroup: check page_cgroup arrays in lookup_page_cgroup() only when necessary
lookup_page_cgroup() is usually used only against pages that are used in
userspace.
The exception is the CONFIG_DEBUG_VM-only memcg check from the page
allocator: it can run on pages without page_cgroup descriptors allocated
when the pages are fed into the page allocator for the first time during
boot or memory hotplug.
Include the array check only when CONFIG_DEBUG_VM is set and save the
unnecessary check in production kernels.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:38 +0000 (17:18 -0800)]
mm: memcg: lookup_page_cgroup (almost) never returns NULL
Pages have their corresponding page_cgroup descriptors set up before
they are used in userspace, and thus managed by a memory cgroup.
The only time where lookup_page_cgroup() can return NULL is in the
CONFIG_DEBUG_VM-only page sanity checking code that executes while
feeding pages into the page allocator for the first time.
Remove the NULL checks against lookup_page_cgroup() results from all
callsites where we know that corresponding page_cgroup descriptors must
be allocated, and add a comment to the callsite that actually does have
to check the return value.
[hughd@google.com: stop oops in mem_cgroup_update_page_stat()]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:35 +0000 (17:18 -0800)]
mm: memcg: clean up fault accounting
The fault accounting functions have a single, memcg-internal user, so they
don't need to be global. In fact, their one-line bodies can be directly
folded into the caller. And since faults happen one at a time, use
this_cpu_inc() directly instead of this_cpu_add(foo, 1).
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:32 +0000 (17:18 -0800)]
mm: unify remaining mem_cont, mem, etc. variable names to memcg
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:29 +0000 (17:18 -0800)]
mm: oom_kill: remove memcg argument from oom_kill_task()
The memcg argument of oom_kill_task() hasn't been used since 341aea2
'oom-kill: remove boost_dying_task_prio()'. Kill it.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ying Han [Fri, 13 Jan 2012 01:18:27 +0000 (17:18 -0800)]
memcg: fix pgpgin/pgpgout documentation
The two memcg stats pgpgin/pgpgout have different meaning than the ones
in vmstat, which indicates that we picked a bad naming for them.
It might be late to change the stat name, but better documentation is
always helpful.
Signed-off-by: Ying Han <yinghan@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhu Yanhai [Fri, 13 Jan 2012 01:18:24 +0000 (17:18 -0800)]
Documentation/cgroups/memory.txt: fix typo
It should be memsw.max_usage_in_bytes. This typo has been there for
a really long time.
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:23 +0000 (17:18 -0800)]
mm: memcg: shorten preempt-disabled section around event checks
Only the ratelimit checks themselves have to run with preemption
disabled, the resulting actions - checking for usage thresholds,
updating the soft limit tree - can and should run with preemption
enabled.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reported-by: Yong Zhang <yong.zhang0@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>
Reported-by: Luis Henriques <henrix@camandro.org>
Tested-by: Luis Henriques <henrix@camandro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Fri, 13 Jan 2012 01:18:20 +0000 (17:18 -0800)]
memcg: make mem_cgroup_split_huge_fixup() more efficient
In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.
But thinking again,
- compound_lock() is held at move_accout...then, it's not necessary
to take move_lock_page_cgroup().
- LRU is locked and all tail pages will go into the same LRU as
head is now on.
- page_cgroup is contiguous in huge page range.
This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
hugepage and reduce costs for spliting.
[akpm@linux-foundation.org: fix typo, per Michal]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:18 +0000 (17:18 -0800)]
mm: memcg: remove unused node/section info from pc->flags
To find the page corresponding to a certain page_cgroup, the pc->flags
encoded the node or section ID with the base array to compare the pc
pointer to.
Now that the per-memory cgroup LRU lists link page descriptors directly,
there is no longer any code that knows the struct page_cgroup of a PFN
but not the struct page.
[hughd@google.com: remove unused node/section info from pc->flags fix]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:15 +0000 (17:18 -0800)]
mm: make per-memcg LRU lists exclusive
Now that all code that operated on global per-zone LRU lists is
converted to operate on per-memory cgroup LRU lists instead, there is no
reason to keep the double-LRU scheme around any longer.
The pc->lru member is removed and page->lru is linked directly to the
per-memory cgroup LRU lists, which removes two pointers from a
descriptor that exists for every page frame in the system.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:10 +0000 (17:18 -0800)]
mm: collect LRU list heads into struct lruvec
Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.
Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:06 +0000 (17:18 -0800)]
mm: vmscan: convert global reclaim to per-memcg LRU lists
The global per-zone LRU lists are about to go away on memcg-enabled
kernels, global reclaim must be able to find its pages on the per-memcg
LRU lists.
Since the LRU pages of a zone are distributed over all existing memory
cgroups, a scan target for a zone is complete when all memory cgroups
are scanned for their proportional share of a zone's memory.
The forced scanning of small scan targets from kswapd is limited to
zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
force-scanning the LRU lists of multiple memory cgroups.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:18:02 +0000 (17:18 -0800)]
mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
root_mem_cgroup, lacking a configurable limit, was never subject to
limit reclaim, so the pages charged to it could be kept off its LRU
lists. They would be found on the global per-zone LRU lists upon
physical memory pressure and it made sense to avoid uselessly linking
them to both lists.
The global per-zone LRU lists are about to go away on memcg-enabled
kernels, with all pages being exclusively linked to their respective
per-memcg LRU lists. As a result, pages of the root_mem_cgroup must
also be linked to its LRU lists again. This is purely about the LRU
list, root_mem_cgroup is still not charged.
The overhead is temporary until the double-LRU scheme is going away
completely.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:17:59 +0000 (17:17 -0800)]
mm: move memcg hierarchy reclaim to generic reclaim code
Memory cgroup limit reclaim and traditional global pressure reclaim will
soon share the same code to reclaim from a hierarchical tree of memory
cgroups.
In preparation of this, move the two right next to each other in
shrink_zone().
The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
limit reclaim function, which still does hierarchy walking on its own,
and a limit (shrinking) reclaim function, which relies on generic
reclaim code to walk the hierarchy.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:17:55 +0000 (17:17 -0800)]
mm: memcg: per-priority per-zone hierarchy scan generations
Memory cgroup limit reclaim currently picks one memory cgroup out of the
target hierarchy, remembers it as the last scanned child, and reclaims
all zones in it with decreasing priority levels.
The new hierarchy reclaim code will pick memory cgroups from the same
hierarchy concurrently from different zones and priority levels, it
becomes necessary that hierarchy roots not only remember the last
scanned child, but do so for each zone and priority level.
Until now, we reclaimed memcgs like this:
mem = mem_cgroup_iter(root)
for each priority level:
for each zone in zonelist:
reclaim(mem, zone)
But subsequent patches will move the memcg iteration inside the loop
over the zones:
for each priority level:
for each zone in zonelist:
mem = mem_cgroup_iter(root)
reclaim(mem, zone)
And to keep with the original scan order - memcg -> priority -> zone -
the last scanned memcg has to be remembered per zone and per priority
level.
Furthermore, global reclaim will be switched to the hierarchy walk as
well. Different from limit reclaim, which can just recheck the limit
after some reclaim progress, its target is to scan all memcgs for the
desired zone pages, proportional to the memcg size, and so reliably
detecting a full hierarchy round-trip will become crucial.
Currently, the code relies on one reclaimer encountering the same memcg
twice, but that is error-prone with concurrent reclaimers. Instead, use
a generation counter that is increased every time the child with the
highest ID has been visited, so that reclaimers can stop when the
generation changes.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:17:52 +0000 (17:17 -0800)]
mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
Memory cgroup hierarchies are currently handled completely outside of
the traditional reclaim code, which is invoked with a single memory
cgroup as an argument for the whole call stack.
Subsequent patches will switch this code to do hierarchical reclaim, so
there needs to be a distinction between a) the memory cgroup that is
triggering reclaim due to hitting its limit and b) the memory cgroup
that is being scanned as a child of a).
This patch introduces a struct mem_cgroup_zone that contains the
combination of the memory cgroup and the zone being scanned, which is
then passed down the stack instead of the zone argument.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:17:50 +0000 (17:17 -0800)]
mm: vmscan: distinguish global reclaim from global LRU scanning
The traditional zone reclaim code is scanning the per-zone LRU lists
during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
lists when reclaiming on behalf of a memory cgroup limit.
Subsequent patches will convert the traditional reclaim code to reclaim
exclusively from the per-memory cgroup LRU lists. As a result, using
the predicate for which LRU list is scanned will no longer be
appropriate to tell global reclaim from limit reclaim.
This patch adds a global_reclaim() predicate to tell direct/kswapd
reclaim from memory cgroup limit reclaim and substitutes it in all
places where currently scanning_global_lru() is used for that.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 13 Jan 2012 01:17:48 +0000 (17:17 -0800)]
mm: memcg: consolidate hierarchy iteration primitives
The memcg naturalization series:
Memory control groups are currently bolted onto the side of
traditional memory management in places where better integration would
be preferrable. To reclaim memory, for example, memory control groups
maintain their own LRU list and reclaim strategy aside from the global
per-zone LRU list reclaim. But an extra list head for each existing
page frame is expensive and maintaining it requires additional code.
This patchset disables the global per-zone LRU lists on memory cgroup
configurations and converts all its users to operate on the per-memory
cgroup lists instead. As LRU pages are then exclusively on one list,
this saves two list pointers for each page frame in the system:
page_cgroup array size with 4G physical memory
vanilla: allocated
31457280 bytes of page_cgroup
patched: allocated
15728640 bytes of page_cgroup
At the same time, system performance for various workloads is
unaffected:
100G sparse file cat, 4G physical memory, 10 runs, to test for code
bloat in the traditional LRU handling and kswapd & direct reclaim
paths, without/with the memory controller configured in
vanilla: 71.603(0.207) seconds
patched: 71.640(0.156) seconds
vanilla: 79.558(0.288) seconds
patched: 77.233(0.147) seconds
100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
bloat in the traditional memory cgroup LRU handling and reclaim path
vanilla: 96.844(0.281) seconds
patched: 94.454(0.311) seconds
4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
swap on SSD, 10 runs, to test for regressions in kswapd & direct
reclaim using per-memcg LRU lists with multiple memcgs and multiple
allocators within each memcg
vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
swap on SSD, 10 runs, to test for regressions in hierarchical memcg
setups
vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
This patch:
There are currently two different implementations of iterating over a
memory cgroup hierarchy tree.
Consolidate them into one worker function and base the convenience
looping-macros on top of it.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Fri, 13 Jan 2012 01:17:44 +0000 (17:17 -0800)]
memcg: add mem_cgroup_replace_page_cache() to fix LRU issue
Commit
ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.
In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.
This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.
Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.
This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.
The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jason Baron [Fri, 13 Jan 2012 01:17:43 +0000 (17:17 -0800)]
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sasha Levin [Fri, 13 Jan 2012 01:17:40 +0000 (17:17 -0800)]
pipe: fail cleanly when root tries F_SETPIPE_SZ with big size
When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
Call Trace:
warn_slowpath_common+0x75/0xb0
warn_slowpath_null+0x15/0x20
__alloc_pages_nodemask+0x5d3/0x7a0
__get_free_pages+0x12/0x50
__kmalloc+0x12b/0x150
pipe_set_size+0x75/0x120
pipe_fcntl+0xf8/0x140
do_fcntl+0x2d4/0x410
sys_fcntl+0x66/0xa0
system_call_fastpath+0x16/0x1b
---[ end trace
432f702e6db7b5ee ]---
Instead, make kcalloc() handle the overflow case and fail quietly.
[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stanislaw Gruszka [Fri, 13 Jan 2012 01:17:39 +0000 (17:17 -0800)]
slub: document setting min order with debug_guardpage_minorder > 0
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mathias Krause [Fri, 13 Jan 2012 01:17:38 +0000 (17:17 -0800)]
parisc, exec: remove redundant set_fs(USER_DS)
The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mathias Krause [Fri, 13 Jan 2012 01:17:36 +0000 (17:17 -0800)]
ia64, exec: remove redundant set_fs(USER_DS)
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 13 Jan 2012 01:17:34 +0000 (17:17 -0800)]
drivers/video/nvidia/nvidia.c: fix warning
Fix the int/bool confusion in there.
drivers/video/nvidia/nvidia.c:1602: warning: return from incompatible pointer type
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Heiko Carstens [Fri, 13 Jan 2012 01:17:33 +0000 (17:17 -0800)]
mm,x86,um: move CMPXCHG_DOUBLE config option
Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
can simply select the option if it is supported.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Heiko Carstens [Fri, 13 Jan 2012 01:17:30 +0000 (17:17 -0800)]
mm,x86,um: move CMPXCHG_LOCAL config option
Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
can simply select the option if it is supported.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Heiko Carstens [Fri, 13 Jan 2012 01:17:27 +0000 (17:17 -0800)]
mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL
While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.
However setting that option will increase the size of struct page by
eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
make sense that a present cpu feature should increase the size of struct
page.
Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.
This patch:
If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
can be selected if a double word aligned struct page is required. Also
update x86 Kconfig so that it should work as before.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 13 Jan 2012 01:17:25 +0000 (17:17 -0800)]
include/linux/linkage.h: remove unused ATTRIB_NORET macro
The uses have been renamed so delete the unused macro.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 13 Jan 2012 01:17:21 +0000 (17:17 -0800)]
treewide: convert uses of ATTRIB_NORETURN to __noreturn
Use the more commonly used __noreturn instead of ATTRIB_NORETURN.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 13 Jan 2012 01:17:17 +0000 (17:17 -0800)]
treewide: remove useless NORET_TYPE macro and uses
It's a very old and now unused prototype marking so just delete it.
Neaten panic pointer argument style to keep checkpatch quiet.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 13 Jan 2012 01:17:14 +0000 (17:17 -0800)]
include/linux/linkage.h: remove unused NORET_AND macro
The only use in kernel.h is gone so remove the macro.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 13 Jan 2012 01:17:13 +0000 (17:17 -0800)]
kernel.h: neaten panic prototype
Use __printf macro.
Convert NORET_AND to ATTRIB_NORET.
Use the normal kernel style for pointer arguments.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Boyd [Fri, 13 Jan 2012 01:17:11 +0000 (17:17 -0800)]
kprobes: silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:
In file included from arch/x86/include/asm/uaccess.h:573,
from kernel/kprobes.c:55:
In function 'copy_from_user',
inlined from 'write_enabled_file_bool' at
kernel/kprobes.c:2191:
arch/x86/include/asm/uaccess_64.h:65:
warning: call to 'copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
presumably due to buf_size being signed causing GCC to fail to see that
buf_size can't become negative.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xiaotian Feng [Fri, 13 Jan 2012 01:17:08 +0000 (17:17 -0800)]
proc: fix null pointer deref in proc_pid_permission()
get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000010
IP: [<
ffffffff81217d34>] proc_pid_permission+0x64/0xe0
PGD
112075067 PUD
112814067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
This is a regression introduced by commit
0499680a ("procfs: add hidepid=
and gid= mount options"). The kernel should return -ESRCH if
get_proc_task() failed.
Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Stephen Wilson <wilsons@start.ca>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anton Vorontsov [Wed, 11 Jan 2012 01:11:46 +0000 (05:11 +0400)]
x86: Get rid of 'dubious one-bit signed bitfield' sprase warning
This very noisy sparse warning appears on almost every file in the
kernel:
CHECK init/main.c
arch/x86/include/asm/thread_info.h:43:55: error: dubious one-bit signed bitfield
arch/x86/include/asm/thread_info.h:44:46: error: dubious one-bit signed bitfield
This patch changes sig_on_uaccess_error and uaccess_err flags to unsigned
type and thus fixes the warning.
Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 12 Jan 2012 16:00:30 +0000 (08:00 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (526 commits)
ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
ALSA: usb-audio: add Yamaha MOX6/MOX8 support
ALSA: virtuoso: add S/PDIF input support for all Xonars
ALSA: ice1724 - Support for ooAoo SQ210a
ALSA: ice1724 - Allow card info based on model only
ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
ALSA: hdspm - Provide unique driver id based on card serial
ASoC: Dynamically allocate the rtd device for a non-empty release()
ASoC: Fix recursive dependency due to select ATMEL_SSC in SND_ATMEL_SOC_SSC
ALSA: hda - Fix the detection of "Loopback Mixing" control for VIA codecs
ALSA: hda - Return the error from get_wcaps_type() for invalid NIDs
ALSA: hda - Use auto-parser for HP laptops with cx20459 codec
ALSA: asihpi - Fix potential Oops in snd_asihpi_cmode_info()
ALSA: hdsp - Fix potential Oops in snd_hdsp_info_pref_sync_ref()
ALSA: hda/cirrus - support for iMac12,2 model
ASoC: cx20442: add bias control over a platform provided regulator
ALSA: usb-audio - Avoid flood of frame-active debug messages
ALSA: snd-usb-us122l: Delete calls to preempt_disable
mfd: Put WM8994 into cache only mode when suspending
...
Fix up trivial conflicts in:
- arch/arm/mach-s3c64xx/mach-crag6410.c:
renamed speyside_wm8962 to tobermory, added littlemill right
next to it
- drivers/base/regmap/{regcache.c,regmap.c}:
duplicate diff that had already come in with other changes in
the regmap tree
Bjorn Helgaas [Thu, 12 Jan 2012 15:01:40 +0000 (08:01 -0700)]
x86/PCI: build amd_bus.o only when CONFIG_AMD_NB=y
We only need amd_bus.o for AMD systems with PCI. arch/x86/pci/Makefile
already depends on CONFIG_PCI=y, so this patch just adds the dependency
on CONFIG_AMD_NB.
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: stable@kernel.org # 2.6.34+ (needs adjustment for k8 -> amd rename)
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Takashi Iwai [Thu, 12 Jan 2012 08:59:18 +0000 (09:59 +0100)]
Merge branch 'topic/hda' into for-linus
Takashi Iwai [Thu, 12 Jan 2012 08:59:14 +0000 (09:59 +0100)]
Merge branch 'topic/misc' into for-linus
Takashi Iwai [Thu, 12 Jan 2012 08:48:20 +0000 (09:48 +0100)]
Merge branch 'for-3.3' of git://git./linux/kernel/git/broonie/sound into topic/asoc
Linus Torvalds [Thu, 12 Jan 2012 07:29:20 +0000 (23:29 -0800)]
Merge tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh
SH/R-Mobile updates for 3.3 merge window.
* tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh: (32 commits)
arm: mach-shmobile: add a resource name for shdma
ARM: mach-shmobile: r8a7779 SMP support V3
ARM: mach-shmobile: Add kota2 defconfig.
ARM: mach-shmobile: Add marzen defconfig.
ARM: mach-shmobile: r8a7779 power domain support V2
ARM: mach-shmobile: Fix up marzen build for recent GIC changes.
ARM: mach-shmobile: r8a7779 PFC function support
ARM: mach-shmobile: Flush caches in platform_cpu_die()
ARM: mach-shmobile: Allow SoC specific CPU kill code
ARM: mach-shmobile: Fix headsmp.S code to use CPUINIT
ARM: mach-shmobile: clock-r8a7779: clkz/clkzs support
ARM: mach-shmobile: clock-r8a7779: add DIV4 clock support
ARM: mach-shmobile: Marzen LAN89218 support
ARM: mach-shmobile: Marzen SCIF2/SCIF4 support
ARM: mach-shmobile: r8a7779 PFC GPIO-only support V2
ARM: mach-shmobile: r8a7779 and Marzen base support V2
sh: pfc: Unlock register support
sh: pfc: Variable bitfield width config register support
sh: pfc: Add config_reg_helper() function
sh: pfc: Convert index to field and value pair
...
Linus Torvalds [Thu, 12 Jan 2012 07:22:52 +0000 (23:22 -0800)]
Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh
SuperH updates for 3.3 merge window.
* tag 'sh-for-linus' of git://github.com/pmundt/linux-sh: (38 commits)
sh: magicpanelr2: Update for parse_mtd_partitions() fallout.
sh: mach-rsk: Update for parse_mtd_partitions() fallout.
sh: sh2a: Improve cache flush/invalidate functions
sh: also without PM_RUNTIME pm_runtime.o must be built
sh: add a resource name for shdma
sh: Remove redundant try_to_freeze() invocations.
sh: Ensure IRQs are enabled across do_notify_resume().
sh: Fix up store queue code for subsys_interface changes.
sh: clkfwk: sh_clk_init_parent() should be called after clk_register()
sh: add platform_device for renesas_usbhs in board-sh7757lcr
sh: modify clock-sh7757 for renesas_usbhs
sh: pfc: ioremap() support
sh: use ioread32/iowrite32 and mapped_reg for div6
sh: use ioread32/iowrite32 and mapped_reg for div4
sh: use ioread32/iowrite32 and mapped_reg for mstp32
sh: extend clock struct with mapped_reg member
sh: clkfwk: clock-sh73a0: all div6_clks use SH_CLK_DIV6_EXT()
sh: clkfwk: clock-sh7724: all div6_clks use SH_CLK_DIV6_EXT()
sh: clock-sh7723: add CLKDEV_ICK_ID for cleanup
serial: sh-sci: Handle GPIO function requests.
...
Linus Torvalds [Thu, 12 Jan 2012 06:52:48 +0000 (22:52 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix lockup by limiting load-balance retries on lock-break
sched: Fix CONFIG_CGROUP_SCHED dependency
sched: Remove empty #ifdefs
Paul Mundt [Thu, 12 Jan 2012 04:49:05 +0000 (13:49 +0900)]
sh: magicpanelr2: Update for parse_mtd_partitions() fallout.
Follows the RSK+ change for the same rationale.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Thu, 12 Jan 2012 04:47:42 +0000 (13:47 +0900)]
sh: mach-rsk: Update for parse_mtd_partitions() fallout.
The RSK+ setup code was doing some pretty dubious things with
parse_mtd_partitions() in order to populate the physmap-flash map
platform data. The physmap-flash driver contains all of the functionality
that we require already, so simply drop the special casing and pad out
the platform data accordingly.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Thu, 12 Jan 2012 04:11:43 +0000 (13:11 +0900)]
Merge branch 'sh/nommu' into sh-latest
Phil Edworthy [Mon, 9 Jan 2012 16:08:47 +0000 (16:08 +0000)]
sh: sh2a: Improve cache flush/invalidate functions
The cache functions lock out interrupts for long periods; this patch
reduces the impact when operating on large address ranges. In such
cases it will:
- Invalidate the entire cache rather than individual addresses.
- Do nothing when flushing the operand cache in write-through mode.
- When flushing the operand cache in write-back mdoe, index the
search for matching addresses on the cache entires instead of the
addresses to flush
Note: sh2a__flush_purge_region was only invalidating the operand
cache, this adds flush.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Thu, 12 Jan 2012 03:57:41 +0000 (12:57 +0900)]
Merge branch 'sh/hwblk' into sh-latest
Paul Mundt [Thu, 12 Jan 2012 03:57:32 +0000 (12:57 +0900)]
Merge branch 'sh/pm-runtime' into sh-latest
Conflicts:
arch/sh/kernel/cpu/sh4a/clock-sh7723.c
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Guennadi Liakhovetski [Tue, 10 Jan 2012 15:04:11 +0000 (16:04 +0100)]
sh: also without PM_RUNTIME pm_runtime.o must be built
When CONFIG_PM_RUNTIME is off, drivers/sh/pm_runtime.o still has to be
built on sh platforms, because then it provides means to statically
switch on device PM clocks.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Shimoda, Yoshihiro [Tue, 10 Jan 2012 05:20:58 +0000 (14:20 +0900)]
sh: add a resource name for shdma
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Paul Mundt [Thu, 12 Jan 2012 03:20:18 +0000 (12:20 +0900)]
Merge branch 'rmobile/smp' into rmobile-latest
Shimoda, Yoshihiro [Tue, 10 Jan 2012 05:21:31 +0000 (14:21 +0900)]
arm: mach-shmobile: add a resource name for shdma
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Linus Torvalds [Thu, 12 Jan 2012 03:13:40 +0000 (19:13 -0800)]
Merge branch 'x86-platform-for-linus' of git://git./linux/kernel/git/tip/tip
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel config: Fix the APB_TIMER selection
x86/mrst: Add additional debug prints for pb_keys
x86/intel config: Revamp configuration to allow for Moorestown and Medfield
x86/intel/scu/ipc: Match the changes in the x86 configuration
x86/apb: Fix configuration constraints
x86: Fix INTEL_MID silly
x86/Kconfig: Cyclone-timer depends on x86-summit
x86: Reduce clock calibration time during slave cpu startup
x86/config: Revamp configuration for MID devices
x86/sfi: Kill the IRQ as id hack
Linus Torvalds [Thu, 12 Jan 2012 03:13:04 +0000 (19:13 -0800)]
Merge branch 'x86-debug-for-linus' of git://git./linux/kernel/git/tip/tip
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, reboot: Fix typo in nmi reboot path
x86, NMI: Add to_cpumask() to silence compile warning
x86, NMI: NMI selftest depends on the local apic
x86: Add stack top margin for stack overflow checking
x86, NMI: NMI-selftest should handle the UP case properly
x86: Fix the 32-bit stackoverflow-debug build
x86, NMI: Add knob to disable using NMI IPIs to stop cpus
x86, NMI: Add NMI IPI selftest
x86, reboot: Use NMI instead of REBOOT_VECTOR to stop cpus
x86: Clean up the range of stack overflow checking
x86: Panic on detection of stack overflow
x86: Check stack overflow in detail
Linus Torvalds [Thu, 12 Jan 2012 03:12:33 +0000 (19:12 -0800)]
Merge branch 'x86-efi-for-linus' of git://git./linux/kernel/git/tip/tip
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Break up large initrd reads
x86, efi: EFI boot stub support
efi: Add EFI file I/O data types
efi.h: Add boottime->locate_handle search types
efi.h: Add graphics protocol guids
efi.h: Add allocation types for boottime->allocate_pages()
efi.h: Add efi_image_loaded_t
efi.h: Add struct definition for boot time services
x86: Don't use magic strings for EFI loader signature
x86: Add missing bzImage fields to struct setup_header
Linus Torvalds [Thu, 12 Jan 2012 03:12:10 +0000 (19:12 -0800)]
Merge branch 'x86-mm-for-linus' of git://git./linux/kernel/git/tip/tip
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/numa: Add constraints check for nid parameters
mm, x86: Remove debug_pagealloc_enabled
x86/mm: Initialize high mem before free_all_bootmem()
arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
x86: Fix mmap random address range
x86, mm: Unify zone_sizes_init()
x86, mm: Prepare zone_sizes_init() for unification
x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
x86, mm: Use max_pfn instead of highend_pfn
x86, mm: Move zone init from paging_init() on 64-bit
x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit
Linus Torvalds [Thu, 12 Jan 2012 02:53:33 +0000 (18:53 -0800)]
Merge branch 'next' of git://git./linux/kernel/git/davej/cpufreq
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: (23 commits)
[CPUFREQ] EXYNOS: Removed useless headers and codes
[CPUFREQ] EXYNOS: Make EXYNOS common cpufreq driver
[CPUFREQ] powernow-k8: Update copyright, maintainer and documentation information
[CPUFREQ] powernow-k8: Fix indexing issue
[CPUFREQ] powernow-k8: Avoid Pstate MSR accesses on systems supporting CPB
[CPUFREQ] update lpj only if frequency has changed
[CPUFREQ] cpufreq:userspace: fix cpu_cur_freq updation
[CPUFREQ] Remove wall variable from cpufreq_gov_dbs_init()
[CPUFREQ] EXYNOS4210: cpufreq code is changed for stable working
[CPUFREQ] EXYNOS4210: Update frequency table for cpu divider
[CPUFREQ] EXYNOS4210: Remove code about bus on cpufreq
[CPUFREQ] s3c64xx: Use pr_fmt() for consistent log messages
cpufreq: OMAP: fixup for omap_device changes, include <linux/module.h>
cpufreq: OMAP: fix freq_table leak
cpufreq: OMAP: put clk if cpu_init failed
cpufreq: OMAP: only supports OPP library
cpufreq: OMAP: dont support !freq_table
cpufreq: OMAP: deny initialization if no mpudev
cpufreq: OMAP: move clk name decision to init
cpufreq: OMAP: notify even with bad boot frequency
...
Linus Torvalds [Thu, 12 Jan 2012 02:53:05 +0000 (18:53 -0800)]
Merge git://git.infradead.org/battery-2.6
* git://git.infradead.org/battery-2.6: (68 commits)
power_supply: Mark da9052 driver as broken
power_supply: Drop usage of nowarn variant of sysfs_create_link()
s3c_adc_battery: Average over more than one adc sample
power_supply: Add DA9052 battery driver
isp1704_charger: Fix missing check
jz4740-battery: Fix signedness bug
power_supply: Assume mains power by default
sbs-battery: Fix devicetree match table
ARM: rx51: Add bq27200 i2c board info
sbs-battery: Change power supply name
devicetree-bindings: Propagate bq20z75->sbs rename to dt bindings
devicetree-bindings: Add vendor entry for Smart Battery Systems
sbs-battery: Rename internals to new name
bq20z75: Rename to sbs-battery
wm97xx_battery: Use DEFINE_MUTEX() for work_lock
max8997_charger: Remove duplicate module.h
lp8727_charger: Some minor fixes for the header
lp8727_charger: Add header file
power_supply: Convert drivers/power/* to use module_platform_driver()
power_supply: Add "unknown" in power supply type
...
Linus Torvalds [Thu, 12 Jan 2012 02:52:23 +0000 (18:52 -0800)]
Merge branch 'slab/for-linus' of git://git./linux/kernel/git/penberg/linux
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slub: disallow changing cpu_partial from userspace for debug caches
slub: add missed accounting
slub: Extract get_freelist from __slab_alloc
slub: Switch per cpu partial page support off for debugging
slub: fix a possible memleak in __slab_alloc()
slub: fix slub_max_order Documentation
slub: add missed accounting
slab: add taint flag outputting to debug paths.
slub: add taint flag outputting to debug paths
slab: introduce slab_max_order kernel parameter
slab: rename slab_break_gfp_order to slab_max_order
Linus Torvalds [Thu, 12 Jan 2012 02:51:55 +0000 (18:51 -0800)]
Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Two bugfixes for md.
One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON. Has been tagged for -stable.
The other is minor missing functionality.
* tag 'md-3.3-fixes' of git://neil.brown.name/md:
md/raid1: perform bad-block tests for WriteMostly devices too.
md: notify the 'degraded' sysfs attribute on failure.
Linus Torvalds [Thu, 12 Jan 2012 02:50:26 +0000 (18:50 -0800)]
Merge branch 'linux-next' of git://git./linux/kernel/git/jbarnes/pci
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: (80 commits)
x86/PCI: Expand the x86_msi_ops to have a restore MSIs.
PCI: Increase resource array mask bit size in pcim_iomap_regions()
PCI: DEVICE_COUNT_RESOURCE should be equal to PCI_NUM_RESOURCES
PCI: pci_ids: add device ids for STA2X11 device (aka ConneXT)
PNP: work around Dell 1536/1546 BIOS MMCONFIG bug that breaks USB
x86/PCI: amd: factor out MMCONFIG discovery
PCI: Enable ATS at the device state restore
PCI: msi: fix imbalanced refcount of msi irq sysfs objects
PCI: kconfig: English typo in pci/pcie/Kconfig
PCI/PM/Runtime: make PCI traces quieter
PCI: remove pci_create_bus()
xtensa/PCI: convert to pci_scan_root_bus() for correct root bus resources
x86/PCI: convert to pci_create_root_bus() and pci_scan_root_bus()
x86/PCI: use pci_scan_bus() instead of pci_scan_bus_parented()
x86/PCI: read Broadcom CNB20LE host bridge info before PCI scan
sparc32, leon/PCI: convert to pci_scan_root_bus() for correct root bus resources
sparc/PCI: convert to pci_create_root_bus()
sh/PCI: convert to pci_scan_root_bus() for correct root bus resources
powerpc/PCI: convert to pci_create_root_bus()
powerpc/PCI: split PHB part out of pcibios_map_io_space()
...
Fix up conflicts in drivers/pci/msi.c and include/linux/pci_regs.h due
to the same patches being applied in other branches.
Magnus Damm [Tue, 10 Jan 2012 08:44:39 +0000 (17:44 +0900)]
ARM: mach-shmobile: r8a7779 SMP support V3
This patch contains r8a7779 SMP support V3 - now including
CPU hotplug offine and online support. The r8a7779 power
domain code is tied together with SMP glue code which allows
us to control the power domains via CPU hotplug.
At this point the kernel boots with the 4 Cortex-A9 cores in
SMP mode and all CPU cores except CPU0 can be hotplugged.
The code in platsmp.c is quite far from pretty, but it is
kept like that intentionally to avoid creating layers of
code that will go away in the near future anyway. The code
needs to be updated when some per-SoC handling code will be
added to the ARM architecture, see the following patch for
more information:
"[RFC PATCH 0/3] Per SoC descriptor"
Signed-off-by: Magnus Damm <damm@opensource.se>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Ben Hutchings [Tue, 10 Jan 2012 03:04:32 +0000 (03:04 +0000)]
cpu: Register a generic CPU device on architectures that currently do not
frv, h8300, m68k, microblaze, openrisc, score, um and xtensa currently
do not register a CPU device. Add the config option GENERIC_CPU_DEVICES
which causes a generic CPU device to be registered for each present CPU,
and make all these architectures select it.
Richard Weinberger <richard@nod.at> covered UML and suggested using
per_cpu.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Hutchings [Tue, 10 Jan 2012 02:59:49 +0000 (02:59 +0000)]
cpu: Do not return errors from cpu_dev_init() which will be ignored
cpu_dev_init() is only called from driver_init(), which does not check
its return value. Therefore make cpu_dev_init() return void.
We must register the CPU subsystem, so panic if this fails.
If sched_create_sysfs_power_savings_entries() fails, the damage is
contained, so ignore this (as before).
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pekka Enberg [Wed, 11 Jan 2012 19:11:29 +0000 (21:11 +0200)]
Merge branch 'slab/urgent' into slab/for-linus
Peter Zijlstra [Wed, 11 Jan 2012 12:11:12 +0000 (13:11 +0100)]
sched: Fix lockup by limiting load-balance retries on lock-break
Eric and David reported dead machines and traced it to commit
a195f004 ("sched: Fix load-balance lock-breaking"), it turns out
there's still a scenario where we can end up re-trying forever.
Since there is no strict forward progress guarantee in the
load-balance iteration we can get stuck re-retrying the same
task-set over and over.
Creating a forward progress guarantee with the existing
structure is somewhat non-trivial, for now simply terminate the
retry loop after a few tries.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: David Ahern <dsahern@gmail.com>
[ logic cleanup as suggested by Eric ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1326297936.2442.157.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Takashi Iwai [Wed, 11 Jan 2012 14:30:53 +0000 (15:30 +0100)]
Merge branch 'for-3.3' of git://git./linux/kernel/git/lrg/asoc into topic/asoc
Liam Girdwood [Wed, 11 Jan 2012 12:43:24 +0000 (12:43 +0000)]
ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
The DL1 PDM interface adds a little gain depending on the output device.
Add a method to retrieve the gain value for machine driver usage.
Signed-off-by: Liam Girdwood <lrg@ti.com>
Takashi Iwai [Wed, 11 Jan 2012 11:34:11 +0000 (12:34 +0100)]
ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
When multiple headphone or other detectable output pins are present,
the power-map has to be updated after resume appropriately, but the
current driver doesn't check all pins but only the first pin (since
it's enough to check it for the mute-behavior). This resulted in the
silent output from the secondary outputs after PM resume.
This patch fixes the problem by checking all pins at (re-)init time.
Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=740347
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Clemens Ladisch [Mon, 19 Dec 2011 22:09:15 +0000 (23:09 +0100)]
ALSA: usb-audio: add Yamaha MOX6/MOX8 support
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Clemens Ladisch [Tue, 6 Dec 2011 09:07:43 +0000 (10:07 +0100)]
ALSA: virtuoso: add S/PDIF input support for all Xonars
All Xonar cards support S/PDIF input, but the cards without optical or
coaxial plugs have only undocumented pin connectors. Support for the
ST/STX was already added in a previous patch; this adds support for the
D1/DX (JP2), DG (J5), DS (J5), and HDAV Slim (J12).
Many thanks to Zoltan Miklos for testing the DS and DX.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pavel Hofman [Tue, 10 Jan 2012 19:28:48 +0000 (20:28 +0100)]
ALSA: ice1724 - Support for ooAoo SQ210a
This card shares PCI ids with Chaintec AV710. Therefore, it will not be
detected automatically, it can only be activated by the module parameter
model=sq210a.
Signed-off-by: Pavel Hofman <pavel.hofman@ivitera.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pavel Hofman [Tue, 10 Jan 2012 19:28:47 +0000 (20:28 +0100)]
ALSA: ice1724 - Allow card info based on model only
When two different cards share the same PCI vendor/subvendor
identification, allow card info based on model only.
Do not require subvendor ID.
Signed-off-by: Pavel Hofman <pavel.hofman@ivitera.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pavel Hofman [Tue, 10 Jan 2012 19:45:28 +0000 (20:45 +0100)]
ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
Add the capture pcm only if there is at least one ADC configured in
the SYSCONF register.
Signed-off-by: Pavel Hofman <pavel.hofman@ivitera.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Adrian Knoth [Tue, 10 Jan 2012 19:58:40 +0000 (20:58 +0100)]
ALSA: hdspm - Provide unique driver id based on card serial
Before, /proc/asound looked like this:
2 [Default ]: HDSPM - RME RayDAT_f1cd85
RME RayDAT S/N 0xf1cd85 at 0xf7300000, irq 18
In case of a second HDSPM card, its name would be Default_1. This is
cumbersome, because the order of the cards isn't stable across reboots.
To help userspace tools referring to the correct card, this commit
provides a unique id for each card:
2 [HDSPMxf1cd85 ]: HDSPM - RME RayDAT_f1cd85
RME RayDAT S/N 0xf1cd85 at 0xf7300000, irq 18
In this example, userspace (configuration files) would then use
hw:HDSPMxf1cd85 to choose the right card.
The serial is masked to 24bits, so this string is always shorter than
sixteen chars.
Signed-off-by: Adrian Knoth <adi@drcomp.erfurt.thur.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Linus Torvalds [Wed, 11 Jan 2012 06:01:27 +0000 (22:01 -0800)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (54 commits)
crypto: gf128mul - remove leftover "(EXPERIMENTAL)" in Kconfig
crypto: serpent-sse2 - remove unneeded LRW/XTS #ifdefs
crypto: serpent-sse2 - select LRW and XTS
crypto: twofish-x86_64-3way - remove unneeded LRW/XTS #ifdefs
crypto: twofish-x86_64-3way - select LRW and XTS
crypto: xts - remove dependency on EXPERIMENTAL
crypto: lrw - remove dependency on EXPERIMENTAL
crypto: picoxcell - fix boolean and / or confusion
crypto: caam - remove DECO access initialization code
crypto: caam - fix polarity of "propagate error" logic
crypto: caam - more desc.h cleanups
crypto: caam - desc.h - convert spaces to tabs
crypto: talitos - convert talitos_error to struct device
crypto: talitos - remove NO_IRQ references
crypto: talitos - fix bad kfree
crypto: convert drivers/crypto/* to use module_platform_driver()
char: hw_random: convert drivers/char/hw_random/* to use module_platform_driver()
crypto: serpent-sse2 - should select CRYPTO_CRYPTD
crypto: serpent - rename serpent.c to serpent_generic.c
crypto: serpent - cleanup checkpatch errors and warnings
...
Linus Torvalds [Wed, 11 Jan 2012 05:51:23 +0000 (21:51 -0800)]
Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-security
* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security: (32 commits)
ima: fix invalid memory reference
ima: free duplicate measurement memory
security: update security_file_mmap() docs
selinux: Casting (void *) value returned by kmalloc is useless
apparmor: fix module parameter handling
Security: tomoyo: add .gitignore file
tomoyo: add missing rcu_dereference()
apparmor: add missing rcu_dereference()
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mpi/mpi-mpow: NULL dereference on allocation failure
digsig: build dependency fix
KEYS: Give key types their own lockdep class for key->sem
TPM: fix transmit_cmd error logic
TPM: NSC and TIS drivers X86 dependency fix
TPM: Export wait_for_stat for other vendor specific drivers
TPM: Use vendor specific function for status probe
tpm_tis: add delay after aborting command
tpm_tis: Check return code from getting timeouts/durations
tpm: Introduce function to poll for result of self test
...
Fix up trivial conflict in lib/Makefile due to addition of CONFIG_MPI
and SIGSIG next to CONFIG_DQL addition.
Linus Torvalds [Wed, 11 Jan 2012 05:46:36 +0000 (21:46 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs4: deal with autofs4_write/autofs4_write races
autofs4: catatonic_mode vs. notify_daemon race
autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
hfsplus: creation of hidden dir on mount can fail
block_dev: Suppress bdev_cache_init() kmemleak warninig
fix shrink_dcache_parent() livelock
coda: switch coda_cnode_make() to sane API as well, clean coda_lookup()
coda: deal correctly with allocation failure from coda_cnode_makectl()
securityfs: fix object creation races
Al Viro [Wed, 11 Jan 2012 03:35:38 +0000 (22:35 -0500)]
autofs4: deal with autofs4_write/autofs4_write races
Just serialize the actual writing of packets into pipe on
a new mutex, independent from everything else in the locking
hierarchy. As soon as something has started feeding a piece
of packet into the pipe to daemon, we *want* everything else
about to try the same to wait until we are done.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 11 Jan 2012 03:24:48 +0000 (22:24 -0500)]
autofs4: catatonic_mode vs. notify_daemon race
we need to hold ->wq_mutex while we are forming the packet to send,
lest we have autofs4_catatonic_mode() setting wq->name.name to NULL
just as autofs4_notify_daemon() decides to memcpy() from it...
We do have check for catatonic mode immediately after that (under
->wq_mutex, as it ought to be) and packet won't be actually sent,
but it'll be too late for us if we oops on that memcpy() from NULL...
Fix is obvious - just extend the area covered by ->wq_mutex over
that switch and check whether it's catatonic *before* doing anything
else.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 11 Jan 2012 03:20:12 +0000 (22:20 -0500)]
autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex
for good, or we might end up with wq inserted into queue after
autofs4_catatonic_mode() had done its thing. It will stick there
forever, since there won't be anything to clear its ->name.name.
A bit of a complication: validate_request() drops and regains ->wq_mutex.
It actually ends up the most convenient place to stick the check into...
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Wed, 11 Jan 2012 02:04:27 +0000 (18:04 -0800)]
Merge tag 'for-linus' of git://git./linux/kernel/git/mst/vhost
lib: use generic pci_iomap on all architectures
Many architectures don't want to pull in iomap.c,
so they ended up duplicating pci_iomap from that file.
That function isn't trivial, and we are going to modify it
https://lkml.org/lkml/2011/11/14/183
so the duplication hurts.
This reduces the scope of the problem significantly,
by moving pci_iomap to a separate file and
referencing that from all architectures.
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
alpha: drop pci_iomap/pci_iounmap from pci-noop.c
mn10300: switch to GENERIC_PCI_IOMAP
mn10300: add missing __iomap markers
frv: switch to GENERIC_PCI_IOMAP
tile: switch to GENERIC_PCI_IOMAP
tile: don't panic on iomap
sparc: switch to GENERIC_PCI_IOMAP
sh: switch to GENERIC_PCI_IOMAP
powerpc: switch to GENERIC_PCI_IOMAP
parisc: switch to GENERIC_PCI_IOMAP
mips: switch to GENERIC_PCI_IOMAP
microblaze: switch to GENERIC_PCI_IOMAP
arm: switch to GENERIC_PCI_IOMAP
alpha: switch to GENERIC_PCI_IOMAP
lib: add GENERIC_PCI_IOMAP
lib: move GENERIC_IOMAP to lib/Kconfig
Fix up trivial conflicts due to changes nearby in arch/{m68k,score}/Kconfig
Linus Torvalds [Wed, 11 Jan 2012 01:39:40 +0000 (17:39 -0800)]
Merge tag 'for-linux-3.3-merge-window' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming
* tag 'for-linux-3.3-merge-window' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming: (29 commits)
C6X: replace tick_nohz_stop/restart_sched_tick calls
C6X: add register_cpu call
C6X: deal with memblock API changes
C6X: fix timer64 initialization
C6X: fix layout of EMIFA registers
C6X: MAINTAINERS
C6X: DSCR - Device State Configuration Registers
C6X: EMIF - External Memory Interface
C6X: general SoC support
C6X: library code
C6X: headers
C6X: ptrace support
C6X: loadable module support
C6X: cache control
C6X: clocks
C6X: build infrastructure
C6X: syscalls
C6X: interrupt handling
C6X: time management
C6X: signal management
...
Linus Torvalds [Wed, 11 Jan 2012 01:37:49 +0000 (17:37 -0800)]
Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblaze
* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: Wire-up new system calls
microblaze: Remove NO_IRQ from architecture
input: xilinx_ps2: Don't use NO_IRQ
block: xsysace: Don't use NO_IRQ
microblaze: Trivial asm fix
microblaze: Fix debug message in module
microblaze: Remove eprintk macro
microblaze: Send CR before LF for early console
microblaze: Change NO_IRQ to 0
microblaze: Use irq_of_parse_and_map for timer
microblaze: intc: Change variable name
microblaze: Use of_find_compatible_node for timer and intc
microblaze: Add __cmpdi2
microblaze: Synchronize __pa __va macros
Linus Torvalds [Wed, 11 Jan 2012 01:37:20 +0000 (17:37 -0800)]
Merge branch 'unicore32' of git://github.com/gxt/linux
* 'unicore32' of git://github.com/gxt/linux:
rtc-puv3: solve section mismatch in rtc-puv3.c
rtc-puv3: using module_platform_driver()
i2c-puv3: using module_platform_driver()
rtc-puv3: irq: remove IRQF_DISABLED
unicore32: Remove IRQF_DISABLED
unicore32: Use set_current_blocked()
unicore32: add ioremap_nocache definition
unicore32: delete specified xlate_dev_mem_ptr
of: add include asm/setup.h in drivers/of/fdt.c
unicore32: standardize /proc/iomem "Kernel code" name
Linus Torvalds [Wed, 11 Jan 2012 01:36:43 +0000 (17:36 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/lliubbo/blackfin
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin:
blackfin: bf561: add adv7183 capture support
blackfin: bf537: add capture support
blackfin: bf548: add capture support
blackfin: time-ts: rm unused func broadcast_timer_setup()
blackfin: i2c-lcd: change default clock rate
blackfin: mac: dsa: add vlan mask in board file
blackfin: bf537: change num_chipselect for spi-sport
blackfin: serial: bfin-uart: remove unused field
bf54x: get mem size: missing break in switch
blackfin: smp: fix msg queue overflow issue
blackfin: config: update macro SPI_BFIN in board file
blackfin: config: update def config for all boards
blackfin: smp: cleanup smp code
blackfin: smp: add suspend and wakeup irq flags
blackfin: bf533-stamp: add missed patches for new asoc driver
blackfin: bf533-stamp: fix ad1836 name
Linus Torvalds [Wed, 11 Jan 2012 00:59:59 +0000 (16:59 -0800)]
Merge branch 'writeback-for-linus' of git://git./linux/kernel/git/wfg/linux
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
writeback: balanced_rate cannot exceed write bandwidth
writeback: do strict bdi dirty_exceeded
writeback: avoid tiny dirty poll intervals
writeback: max, min and target dirty pause time
writeback: dirty ratelimit - think time compensation
btrfs: fix dirtied pages accounting on sub-page writes
writeback: fix dirtied pages accounting on redirty
writeback: fix dirtied pages accounting on sub-page writes
writeback: charge leaked page dirties to active tasks
writeback: Include all dirty inodes in background writeback
Linus Torvalds [Wed, 11 Jan 2012 00:42:48 +0000 (16:42 -0800)]
Merge branch 'akpm' (aka "Andrew's patch-bomb")
Andrew elucidates:
- First installmeant of MM. We have a HUGE number of MM patches this
time. It's crazy.
- MAINTAINERS updates
- backlight updates
- leds
- checkpatch updates
- misc ELF stuff
- rtc updates
- reiserfs
- procfs
- some misc other bits
* akpm: (124 commits)
user namespace: make signal.c respect user namespaces
workqueue: make alloc_workqueue() take printf fmt and args for name
procfs: add hidepid= and gid= mount options
procfs: parse mount options
procfs: introduce the /proc/<pid>/map_files/ directory
procfs: make proc_get_link to use dentry instead of inode
signal: add block_sigmask() for adding sigmask to current->blocked
sparc: make SA_NOMASK a synonym of SA_NODEFER
reiserfs: don't lock root inode searching
reiserfs: don't lock journal_init()
reiserfs: delay reiserfs lock until journal initialization
reiserfs: delete comments referring to the BKL
drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
drivers/rtc/: remove redundant spi driver bus initialization
drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
rtc: convert drivers/rtc/* to use module_platform_driver()
drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
...
Serge E. Hallyn [Tue, 10 Jan 2012 23:11:37 +0000 (15:11 -0800)]
user namespace: make signal.c respect user namespaces
ipc/mqueue.c: for __SI_MESQ, convert the uid being sent to recipient's
user namespace. (new, thanks Oleg)
__send_signal: convert current's uid to the recipient's user namespace
for any siginfo which is not SI_FROMKERNEL (patch from Oleg, thanks
again :)
do_notify_parent and do_notify_parent_cldstop: map task's uid to parent's
user namespace
ptrace_signal maps parent's uid into current's user namespace before
including in signal to current. IIUC Oleg has argued that this shouldn't
matter as the debugger will play with it, but it seems like not converting
the value currently being set is misleading.
Changelog:
Sep 20: Inspired by Oleg's suggestion, define map_cred_ns() helper to
simplify callers and help make clear what we are translating
(which uid into which namespace). Passing the target task would
make callers even easier to read, but we pass in user_ns because
current_user_ns() != task_cred_xxx(current, user_ns).
Sep 20: As recommended by Oleg, also put task_pid_vnr() under rcu_read_lock
in ptrace_signal().
Sep 23: In send_signal(), detect when (user) signal is coming from an
ancestor or unrelated user namespace. Pass that on to __send_signal,
which sets si_uid to 0 or overflowuid if needed.
Oct 12: Base on Oleg's fixup_uid() patch. On top of that, handle all
SI_FROMKERNEL cases at callers, because we can't assume sender is
current in those cases.
Nov 10: (mhelsley) rename fixup_uid to more meaningful usern_fixup_signal_uid
Nov 10: (akpm) make the !CONFIG_USER_NS case clearer
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Serge Hallyn <serge.hallyn@canonical.com>
Subject: __send_signal: pass q->info, not info, to userns_fixup_signal_uid (v2)
Eric Biederman pointed out that passing info is a bug and could lead to a
NULL pointer deref to boot.
A collection of signal, securebits, filecaps, cap_bounds, and a few other
ltp tests passed with this kernel.
Changelog:
Nov 18: previous patch missed a leading '&'
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: ipc/mqueue: lock() => unlock() typo
There was a double lock typo introduced in
b085f4bd6b21 "user namespace:
make signal.c respect user namespaces"
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Tue, 10 Jan 2012 23:11:35 +0000 (15:11 -0800)]
workqueue: make alloc_workqueue() take printf fmt and args for name
alloc_workqueue() currently expects the passed in @name pointer to remain
accessible. This is inconvenient and a bit silly given that the whole wq
is being dynamically allocated. This patch updates alloc_workqueue() and
friends to take printf format string instead of opaque string and matching
varargs at the end. The name is allocated together with the wq and
formatted.
alloc_ordered_workqueue() is converted to a macro to unify varargs
handling with alloc_workqueue(), and, while at it, add comment to
alloc_workqueue().
None of the current in-kernel users pass in string with '%' as constant
name and this change shouldn't cause any problem.
[akpm@linux-foundation.org: use __printf]
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasiliy Kulikov [Tue, 10 Jan 2012 23:11:31 +0000 (15:11 -0800)]
procfs: add hidepid= and gid= mount options
Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.
The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:
hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.
hidepid=1 means users may not access any /proc/<pid>/ directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.
hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.
gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.
hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:
http://www.openwall.com/lists/oss-security/2011/11/05/3
hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.
Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasiliy Kulikov [Tue, 10 Jan 2012 23:11:27 +0000 (15:11 -0800)]
procfs: parse mount options
Add support for procfs mount options. Actual mount options are coming in
the next patches.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg KH <greg@kroah.com>
Cc: Theodore Tso <tytso@MIT.EDU>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>