Oleg Nesterov [Wed, 23 Sep 2009 22:56:47 +0000 (15:56 -0700)]
do_wait() wakeup optimization: child_wait_callback: check __WNOTHREAD case
Suggested by Roland.
do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
it is ->real_parent and the child is not traced. IOW, caller == p->parent
otherwise we should not wake up.
Change child_wait_callback() to check this. Ratan reports the workload with
CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 23 Sep 2009 22:56:46 +0000 (15:56 -0700)]
do_wait() wakeup optimization: change __wake_up_parent() to use filtered wakeup
Ratan Nalumasu reported that in a process with many threads doing
unnecessary wakeups. Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.
Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.
We can make child_wait_callback() more clever later, right now it only
checks eligible_child().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 23 Sep 2009 22:56:45 +0000 (15:56 -0700)]
do_wait() wakeup optimization: shift security_task_wait() from eligible_child() to wait_consider_task()
Preparation, no functional changes.
eligible_child() has a single caller, wait_consider_task(). We can move
security_task_wait() out from eligible_child(), this allows us to use it
for filtered wake_up().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 23 Sep 2009 22:56:44 +0000 (15:56 -0700)]
ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee
The bug is old, it wasn't cause by recent changes.
Test case:
static void *tfunc(void *arg)
{
int pid = (long)arg;
assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);
sleep(1);
return NULL;
}
int main(void)
{
pthread_t th;
long pid = fork();
if (!pid)
pause();
signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);
int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);
return 0;
}
Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.
The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
we should wake up other threads which can sleep in do_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daisuke Nishimura [Wed, 23 Sep 2009 22:56:43 +0000 (15:56 -0700)]
memcg: show swap usage in stat file
We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would
be useful for users to show swap usage in memory.stat file, because they
don't need calculate memsw.usage - res.usage to know swap usage.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:42 +0000 (15:56 -0700)]
memcg: improve resource counter scalability
Reduce the resource counter overhead (mostly spinlock) associated with the
root cgroup. This is a part of the several patches to reduce mem cgroup
overhead. I had posted other approaches earlier (including using percpu
counters). Those patches will be a natural addition and will be added
iteratively on top of these.
The patch stops resource counter accounting for the root cgroup. The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable). What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics(). For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.
The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory. This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
recursively when hierarchy is enabled.
The tests results I see on a 24 way show that
1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
cgroup_disable=memory.
Here is a sample of my program runs
Without Patch
Performance counter stats for '/home/balbir/parallel_pagefault':
7192804.124144 task-clock-msecs # 23.937 CPUs
424691 context-switches # 0.000 M/sec
267 CPU-migrations # 0.000 M/sec
28498113 page-faults # 0.004 M/sec
5826093739340 cycles # 809.989 M/sec
408883496292 instructions # 0.070 IPC
7057079452 cache-references # 0.981 M/sec
3036086243 cache-misses # 0.422 M/sec
300.
485365680 seconds time elapsed
With cgroup_disable=memory
Performance counter stats for '/home/balbir/parallel_pagefault':
7182183.546587 task-clock-msecs # 23.915 CPUs
425458 context-switches # 0.000 M/sec
203 CPU-migrations # 0.000 M/sec
92545093 page-faults # 0.013 M/sec
6034363609986 cycles # 840.185 M/sec
437204346785 instructions # 0.072 IPC
6636073192 cache-references # 0.924 M/sec
2358117732 cache-misses # 0.328 M/sec
300.
320905827 seconds time elapsed
With this patch applied
Performance counter stats for '/home/balbir/parallel_pagefault':
7191619.223977 task-clock-msecs # 23.955 CPUs
422579 context-switches # 0.000 M/sec
88 CPU-migrations # 0.000 M/sec
91946060 page-faults # 0.013 M/sec
5957054385619 cycles # 828.333 M/sec
1058117350365 instructions # 0.178 IPC
9161776218 cache-references # 1.274 M/sec
1920494280 cache-misses # 0.267 M/sec
300.
218764862 seconds time elapsed
Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)
For a single run
Without patch
real 27m8.988s
user 87m24.916s
sys 382m6.037s
With patch
real 4m18.607s
user 84m58.943s
sys 50m52.682s
With config turned off
real 4m54.972s
user 90m13.456s
sys 50m19.711s
NOTE: The data looks counterintuitive due to the increased performance
with the patch, even over the config being turned off. We probably need
more runs, but so far all testing has shown that the patches definitely
help.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:39 +0000 (15:56 -0700)]
memory controller: soft limit reclaim on contention
Implement reclaim from groups over their soft limit
Permit reclaim from memory cgroups on contention (via the direct reclaim
path).
memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.
Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off. The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.
[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:38 +0000 (15:56 -0700)]
memory controller: soft limit refactor reclaim flags
Refactor mem_cgroup_hierarchical_reclaim()
Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
flags, so that new parameters don't have to be passed as we make the
reclaim routine more flexible
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:37 +0000 (15:56 -0700)]
memory controller: soft limit organize cgroups
Organize cgroups over soft limit in a RB-Tree
Introduce an RB-Tree for storing memory cgroups that are over their soft
limit. The overall goal is to
1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit
The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.
[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:36 +0000 (15:56 -0700)]
memory controller: soft limit interface
Add an interface to allow get/set of soft limits. Soft limits for memory
plus swap controller (memsw) is currently not supported. Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.
Kamezawa-San raised a question as to whether soft limit should belong to
res_counter. Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here. Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:34 +0000 (15:56 -0700)]
memory controller: soft limit documentation
Soft limits is a new feature for the memory resource controller, something
similar has existed in the group scheduler in the form of shares. The CPU
controllers interpretation of shares is very different though.
Soft limits are the most useful feature to have for environments where the
administrator wants to overcommit the system, such that only on memory
contention do the limits become active. The current soft limits
implementation provides a soft_limit_in_bytes interface for the memory
controller and not for memory+swap controller. The implementation
maintains an RB-Tree of groups that exceed their soft limit and starts
reclaiming from the group that exceeds this limit by the maximum amount.
This patch:
Add documentation for soft limits
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki [Wed, 23 Sep 2009 22:56:33 +0000 (15:56 -0700)]
memcg: add comments explaining memory barriers
Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().
[akpm@linux-foundation.org: coding-style fixes]
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Balbir Singh [Wed, 23 Sep 2009 22:56:32 +0000 (15:56 -0700)]
memcg: remove the overhead associated with the root cgroup
Change the memory cgroup to remove the overhead associated with accounting
all pages in the root cgroup. As a side-effect, we can no longer set a
memory hard limit in the root cgroup.
A new flag to track whether the page has been accounted or not has been
added as well. Flags are now set atomically for page_cgroup,
pcg_default_flags is now obsolete and removed.
[akpm@linux-foundation.org: fix a few documentation glitches]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Blum [Wed, 23 Sep 2009 22:56:31 +0000 (15:56 -0700)]
cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time
Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)
Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.
[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Blum [Wed, 23 Sep 2009 22:56:29 +0000 (15:56 -0700)]
cgroups: change css_set freeing mechanism to be under RCU
Changes css_set freeing mechanism to be under RCU
This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Blum [Wed, 23 Sep 2009 22:56:28 +0000 (15:56 -0700)]
cgroups: use vmalloc for large cgroups pidlist allocations
Separates all pidlist allocation requests to a separate function that
judges based on the requested size whether or not the array needs to be
vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Blum [Wed, 23 Sep 2009 22:56:27 +0000 (15:56 -0700)]
cgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces
Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process
seeing results from the other's namespace. Rather than one pidlist for
each file in a cgroup, we now keep a list of pidlists keyed by namespace
and file type (tasks versus procs) in which entries are placed on demand.
Each pidlist has its own lock, and that the pidlists themselves are passed
around in the seq_file's private pointer means we don't have to touch the
cgroup or its master list except when creating and destroying entries.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Blum [Wed, 23 Sep 2009 22:56:26 +0000 (15:56 -0700)]
cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids
struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file. Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.
Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Wed, 23 Sep 2009 22:56:25 +0000 (15:56 -0700)]
cgroups: revert "cgroups: fix pid namespace bug"
The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.
The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved. The attach interface may need to be revised if this becomes the
case.
Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count. This scheme should introduce no extra
overhead in the fork path when there's no contention.
The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.
This patch:
Revert
commit
096b7fe012d66ed55e98bc8022405ede0cc80e96
Author: Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700
cgroups: fix pid namespace bug
This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.
The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches. I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Wed, 23 Sep 2009 22:56:23 +0000 (15:56 -0700)]
cgroups: allow cgroup hierarchies to be created with no bound subsystems
This patch removes the restriction that a cgroup hierarchy must have at
least one bound subsystem. The mount option "none" is treated as an
explicit request for no bound subsystems.
A hierarchy with no subsystems can be useful for plain task tracking, and
is also a step towards the support for multiply-bindable subsystems.
As part of this change, the hierarchy id is no longer calculated from the
bitmask of subsystems in the hierarchy (since this is not guaranteed to be
unique) but is allocated via an ida. Reference counts on cgroups from
css_set objects are now taken explicitly one per hierarchy, rather than
one per subsystem.
Example usage:
mount -t cgroup -o none,name=foo cgroup /mnt/cgroup
Based on the "no-op"/"none" subsystem concept proposed by
kamezawa.hiroyu@jp.fujitsu.com
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Wed, 23 Sep 2009 22:56:22 +0000 (15:56 -0700)]
cgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup
Currently the cgroups code makes the assumption that the subsystem
pointers in a struct css_set uniquely identify the hierarchy->cgroup
mappings associated with the css_set; and there's no way to directly
identify the associated set of cgroups other than by indirecting through
the appropriate subsystem state pointers.
This patch removes the need for that assumption by adding a back-pointer
from struct cg_cgroup_link object to its associated cgroup; this allows
the set of cgroups to be determined by traversing the cg_links list in
the struct css_set.
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Wed, 23 Sep 2009 22:56:20 +0000 (15:56 -0700)]
cgroups: move the cgroup debug subsys into cgroup.c to access internal state
While it's architecturally clean to have the cgroup debug subsystem be
completely independent of the cgroups framework, it limits its usefulness
for debugging the contents of internal data structures. Move the debug
subsystem code into the scope of all the cgroups data structures to make
more detailed debugging possible.
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Wed, 23 Sep 2009 22:56:19 +0000 (15:56 -0700)]
cgroups: support named cgroups hierarchies
To simplify referring to cgroup hierarchies in mount statements, and to
allow disambiguation in the presence of empty hierarchies and
multiply-bindable subsystems this patch adds support for naming a new
cgroup hierarchy via the "name=" mount option
A pre-existing hierarchy may be specified by either name or by subsystems;
a hierarchy's name cannot be changed by a remount operation.
Example usage:
# To create a hierarchy called "foo" containing the "cpu" subsystem
mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1
# To mount the "foo" hierarchy on a second location
mount -t cgroup -oname=foo cgroup /mnt/cgroup2
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xiaotian Feng [Wed, 23 Sep 2009 22:56:18 +0000 (15:56 -0700)]
cgroups: make unlock sequence in cgroup_get_sb consistent
Make the last unlock sequence consistent with previous unlock sequeue.
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Wed, 23 Sep 2009 22:56:17 +0000 (15:56 -0700)]
docs: fix various Documentation/ paths in header files
Fix various Documentation/ paths in include/linux/.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wu Fengguang [Wed, 23 Sep 2009 22:56:16 +0000 (15:56 -0700)]
page-types: add feature for walking process address space
Introduce "-p|--pid <pid>" for walking the process address space. The
default action is to walk raw memory PFNs.
Both the virtual address and physical address of each present pages will
be listed:
# ./tools/vm/page-types -lp $$ | head -3
voffset offset len flags
400 11bebe 1 __RU_lA____M______________________
402 11bebc 1 __RU_lA____M______________________
Note that voffset/offset/len are now showed as hex numbers.
[akpm@linux-foundation.org: coding-style fixes]
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Triplett [Wed, 23 Sep 2009 22:56:15 +0000 (15:56 -0700)]
Documentation/vm/.gitignore: add page-types
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jaswinder Singh Rajput [Wed, 23 Sep 2009 22:56:14 +0000 (15:56 -0700)]
includecheck fix: Documentation, cfag12864b-example.c
fix the following 'make includecheck' warning:
Documentation/auxdisplay/cfag12864b-example.c: string.h is included more than once.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xiaotian Feng [Wed, 23 Sep 2009 22:56:13 +0000 (15:56 -0700)]
Documentation: update stale definition of file-nr in fs.txt
In "documentation: update Documentation/filesystem/proc.txt and
Documentation/sysctls" (commit
760df93ec) we merged /proc/sys/fs
documentation in Documentation/sysctl/fs.txt and
Documentation/filesystem/proc.txt, but stale file-nr definition
remained.
This patch adds back the right fs-nr definition for 2.6 kernel.
Signed-off-by: Xiaotian Feng<dfeng@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peng Tao [Wed, 23 Sep 2009 22:56:13 +0000 (15:56 -0700)]
doc/filesystems: more mount cleanups
Documentation/filesystems/sharedsubtree.txt needs updating because the
mount command in util-linux package is well aware of shared subtree
features now. The patch also fixes two typos in sharedsubtree.txt.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Wed, 23 Sep 2009 22:56:11 +0000 (15:56 -0700)]
doc/filesystems: remove smount program
mount(8) handles shared subtrees just fine, so remove the smount program
from Documentation/filesystems/sharedsubtree.txt.
Fix annoying "Lets" -> "Let's".
Insert space between '#' prompt and "mount" command.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhaolei [Wed, 23 Sep 2009 22:56:10 +0000 (15:56 -0700)]
time: add function to convert between calendar time and broken-down time for universal use
There are many similar code in kernel for one object: convert time between
calendar time and broken-down time.
Here is some source I found:
fs/ncpfs/dir.c
fs/smbfs/proc.c
fs/fat/misc.c
fs/udf/udftime.c
fs/cifs/netmisc.c
net/netfilter/xt_time.c
drivers/scsi/ips.c
drivers/input/misc/hp_sdc_rtc.c
drivers/rtc/rtc-lib.c
arch/ia64/hp/sim/boot/fw-emu.c
arch/m68k/mac/misc.c
arch/powerpc/kernel/time.c
arch/parisc/include/asm/rtc.h
...
We can make a common function for this type of conversion, At least we
can get following benefit:
1: Make kernel simple and unify
2: Easy to fix bug in converting code
3: Reduce clone of code in future
For example, I'm trying to make ftrace display walltime,
this patch will make me easy.
This code is based on code from glibc-2.6
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From: Mel Gorman [Wed, 23 Sep 2009 22:56:05 +0000 (15:56 -0700)]
hugetlbfs: do not call user_shm_lock() for MAP_HUGETLB fix
Commit
6bfde05bf5c ("hugetlbfs: allow the creation of files suitable for
MAP_PRIVATE on the vfs internal mount") altered can_do_hugetlb_shm() to
check if a file is being created for shared memory or mmap(). If this
returns false, we then unconditionally call user_shm_lock() triggering a
warning. This block should never be entered for MAP_HUGETLB. This
patch partially reverts the problem and fixes the check.
Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Izik Eidus [Wed, 23 Sep 2009 22:56:04 +0000 (15:56 -0700)]
ksm: change default values to better fit into mainline kernel
Now that ksm is in mainline it is better to change the default values to
better fit to most of the users.
This patch change the ksm default values to be:
ksm_thread_pages_to_scan = 100 (instead of 200)
ksm_thread_sleep_millisecs = 20 (like before)
ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
disabled by default)
ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)
The important aspect of this patch is: it disables ksm by default, and sets
the number of the kernel_pages that can be allocated to be a reasonable
number.
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo Molnar [Wed, 23 Sep 2009 22:56:02 +0000 (15:56 -0700)]
input: fix build failures caused by Kconfig Winbond WPCD376I Consumer IR hardware driver Kconfig entry
Fix these warnings:
drivers/built-in.o: In function `apanel_remove':
apanel.c:(.text+0x56e852): undefined reference to `led_classdev_unregister'
drivers/built-in.o: In function `apanel_probe':
apanel.c:(.text+0x56eae3): undefined reference to `led_classdev_register'
drivers/built-in.o: In function `acpi_fujitsu_hotkey_add':
fujitsu-laptop.c:(.text+0x5d7647): undefined reference to `led_classdev_register'
fujitsu-laptop.c:(.text+0x5d76b5): undefined reference to `led_classdev_register'
drivers/built-in.o: In function `wbcir_probe':
winbond-cir.c:(.devinit.text+0x5f375): undefined reference to `led_classdev_register'
winbond-cir.c:(.devinit.text+0x5f663): undefined reference to `led_classdev_unregister'
drivers/built-in.o: In function `wbcir_remove':
winbond-cir.c:(.devexit.text+0x7f23): undefined reference to `led_classdev_unregister'
drivers/built-in.o: In function `fujitsu_cleanup':
fujitsu-laptop.c:(.exit.text+0xbe37): undefined reference to `led_classdev_unregister'
fujitsu-laptop.c:(.exit.text+0xbe53): undefined reference to `led_classdev_unregister'
It happens because the new INPUT_WINBOND_CIR driver relies on new-leds
infrastructure - but does not select it in drivers/input/misc/Kconfig.
But it selects LEDS_CLASS, which confuses a number of other drivers into
thinking that all the leds infrastructure is in place.
Fix this by selecting NEW_LEDS as well, like similar drivers do.
Eventually, this whole leds infrastructure complexity should be
cleaned up, it's been going on for years.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: David Härdeman <david@hardeman.nu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 24 Sep 2009 01:14:11 +0000 (18:14 -0700)]
Merge git://git./linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (39 commits)
cpumask: Move deprecated functions to end of header.
cpumask: remove unused deprecated functions, avoid accusations of insanity
cpumask: use new-style cpumask ops in mm/quicklist.
cpumask: use mm_cpumask() wrapper: x86
cpumask: use mm_cpumask() wrapper: um
cpumask: use mm_cpumask() wrapper: mips
cpumask: use mm_cpumask() wrapper: mn10300
cpumask: use mm_cpumask() wrapper: m32r
cpumask: use mm_cpumask() wrapper: arm
cpumask: Use accessors for cpu_*_mask: um
cpumask: Use accessors for cpu_*_mask: powerpc
cpumask: Use accessors for cpu_*_mask: mips
cpumask: Use accessors for cpu_*_mask: m32r
cpumask: remove arch_send_call_function_ipi
cpumask: arch_send_call_function_ipi_mask: s390
cpumask: arch_send_call_function_ipi_mask: powerpc
cpumask: arch_send_call_function_ipi_mask: mips
cpumask: arch_send_call_function_ipi_mask: m32r
cpumask: arch_send_call_function_ipi_mask: alpha
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: ia64
...
Alexey Dobriyan [Thu, 24 Sep 2009 00:22:25 +0000 (04:22 +0400)]
headers: utsname.h redux
* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it
NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sebastian Andrzej Siewior [Wed, 23 Sep 2009 23:02:55 +0000 (01:02 +0200)]
Revert "kmod: fix race in usermodehelper code"
This reverts commit
c02e3f361c7 ("kmod: fix race in usermodehelper code")
The patch is wrong. UMH_WAIT_EXEC is called with VFORK what ensures
that the child finishes prior returing back to the parent. No race.
In fact, the patch makes it even worse because it does the thing it
claims not do:
- It calls ->complete() on UMH_WAIT_EXEC
- the complete() callback may de-allocated subinfo as seen in the
following call chain:
[<
c009f904>] (__link_path_walk+0x20/0xeb4) from [<
c00a094c>] (path_walk+0x48/0x94)
[<
c00a094c>] (path_walk+0x48/0x94) from [<
c00a0a34>] (do_path_lookup+0x24/0x4c)
[<
c00a0a34>] (do_path_lookup+0x24/0x4c) from [<
c00a158c>] (do_filp_open+0xa4/0x83c)
[<
c00a158c>] (do_filp_open+0xa4/0x83c) from [<
c009ba90>] (open_exec+0x24/0xe0)
[<
c009ba90>] (open_exec+0x24/0xe0) from [<
c009bfa8>] (do_execve+0x7c/0x2e4)
[<
c009bfa8>] (do_execve+0x7c/0x2e4) from [<
c0026a80>] (kernel_execve+0x34/0x80)
[<
c0026a80>] (kernel_execve+0x34/0x80) from [<
c004b514>] (____call_usermodehelper+0x130/0x148)
[<
c004b514>] (____call_usermodehelper+0x130/0x148) from [<
c0024858>] (kernel_thread_exit+0x0/0x8)
and the path pointer was NULL. Good that ARM's kernel_execve()
doesn't check the pointer for NULL or else I wouldn't notice it.
The only race there might be is with UMH_NO_WAIT but it is too late for
me to investigate it now. UMH_WAIT_PROC could probably also use VFORK
and we could save one exec. So the only race I see is with UMH_NO_WAIT
and recent scheduler changes where the child does not always run first
might have trigger here something but as I said, it is late....
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rusty Russell [Thu, 24 Sep 2009 15:34:53 +0000 (09:34 -0600)]
cpumask: Move deprecated functions to end of header.
The new ones have pretty kerneldoc. Move the old ones to the end to
avoid confusing people.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: benh@kernel.crashing.org
Rusty Russell [Thu, 24 Sep 2009 15:34:52 +0000 (09:34 -0600)]
cpumask: remove unused deprecated functions, avoid accusations of insanity
We're not forcing removal of the old cpu_ functions, but we might as
well delete the now-unused ones.
Especially CPUMASK_ALLOC and friends. I actually got a phone call (!)
from a hacker who thought I had introduced them as the new cpumask
API. He seemed bewildered that I had lost all taste.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: benh@kernel.crashing.org
Rusty Russell [Thu, 24 Sep 2009 15:34:52 +0000 (09:34 -0600)]
cpumask: use new-style cpumask ops in mm/quicklist.
This slipped past the previous sweeps.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Rusty Russell [Thu, 24 Sep 2009 15:34:51 +0000 (09:34 -0600)]
cpumask: use mm_cpumask() wrapper: x86
Makes code futureproof against the impending change to mm->cpu_vm_mask (to be a pointer).
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:51 +0000 (09:34 -0600)]
cpumask: use mm_cpumask() wrapper: um
Makes code futureproof against the impending change to mm->cpu_vm_mask.
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:50 +0000 (09:34 -0600)]
cpumask: use mm_cpumask() wrapper: mips
Makes code futureproof against the impending change to mm->cpu_vm_mask.
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:50 +0000 (09:34 -0600)]
cpumask: use mm_cpumask() wrapper: mn10300
Makes code futureproof against the impending change to mm->cpu_vm_mask
(to be a pointer).
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Also change the actual arg name here to "mm" (which it is), not "task".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:49 +0000 (09:34 -0600)]
cpumask: use mm_cpumask() wrapper: m32r
Makes code futureproof against the impending change to mm->cpu_vm_mask.
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Hirokazu Takata <takata@linux-m32r.org> (fixes)
Rusty Russell [Thu, 24 Sep 2009 15:34:49 +0000 (09:34 -0600)]
cpumask: use mm_cpumask() wrapper: arm
Makes code futureproof against the impending change to mm->cpu_vm_mask.
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:48 +0000 (09:34 -0600)]
cpumask: Use accessors for cpu_*_mask: um
Use the accessors rather than frobbing bits directly (the new versions
are const).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Rusty Russell [Thu, 24 Sep 2009 15:34:48 +0000 (09:34 -0600)]
cpumask: Use accessors for cpu_*_mask: powerpc
Use the accessors rather than frobbing bits directly (the new versions
are const).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Rusty Russell [Thu, 24 Sep 2009 15:34:47 +0000 (09:34 -0600)]
cpumask: Use accessors for cpu_*_mask: mips
Use the accessors rather than frobbing bits directly (the new versions
are const).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Rusty Russell [Thu, 24 Sep 2009 15:34:47 +0000 (09:34 -0600)]
cpumask: Use accessors for cpu_*_mask: m32r
Use the accessors rather than frobbing bits directly (the new versions
are const).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Rusty Russell [Thu, 24 Sep 2009 15:34:46 +0000 (09:34 -0600)]
cpumask: remove arch_send_call_function_ipi
Now everyone is converted to arch_send_call_function_ipi_mask, remove
the shim and the #defines.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:45 +0000 (09:34 -0600)]
cpumask: arch_send_call_function_ipi_mask: s390
We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:45 +0000 (09:34 -0600)]
cpumask: arch_send_call_function_ipi_mask: powerpc
We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask(), and by defining
it, the old arch_send_call_function_ipi is defined by the core code.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:44 +0000 (09:34 -0600)]
cpumask: arch_send_call_function_ipi_mask: mips
We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask(), and by defining
it, the old arch_send_call_function_ipi is defined by the core code.
We also take the chance to wean the implementations off the
obsolescent for_each_cpu_mask(): making send_ipi_mask take the pointer
seemed the most natural way to ensure all implementations used
for_each_cpu.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:43 +0000 (09:34 -0600)]
cpumask: arch_send_call_function_ipi_mask: m32r
We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask(), and by defining
it, the old arch_send_call_function_ipi is defined by the core code.
We also take the chance to wean the implementations off the
obsolescent for_each_cpu_mask(): making send_ipi_mask take the pointer
seemed the most natural way to ensure all implementations used
for_each_cpu.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:43 +0000 (09:34 -0600)]
cpumask: arch_send_call_function_ipi_mask: alpha
We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask().
We also take the chance to wean the send_ipi_message off the
obsolescent for_each_cpu_mask(): making it take a pointer seemed the
most natural way to do this.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:42 +0000 (09:34 -0600)]
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: ia64
There were replaced by topology_core_cpumask and topology_thread_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:42 +0000 (09:34 -0600)]
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: powerpc
There were replaced by topology_core_cpumask and topology_thread_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:41 +0000 (09:34 -0600)]
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: s390
There were replaced by topology_core_cpumask and topology_thread_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:41 +0000 (09:34 -0600)]
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: sparc
There were replaced by topology_core_cpumask and topology_thread_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:40 +0000 (09:34 -0600)]
cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: core
There were replaced by topology_core_cpumask and topology_thread_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:40 +0000 (09:34 -0600)]
cpumask: remove the deprecated smp_call_function_mask()
Everyone is now using smp_call_function_many().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:39 +0000 (09:34 -0600)]
ia64: convert last user of smp_call_function_mask
smp_call_function_many is the new version: it takes a pointer. Also,
use mm accessor macro while we're changing this.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:38 +0000 (09:34 -0600)]
cpumask: don't define set_cpus_allowed() if CONFIG_CPUMASK_OFFSTACK=y
You're not supposed to pass cpumasks on the stack in that case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Bjorn Helgaas [Thu, 24 Sep 2009 15:34:38 +0000 (09:34 -0600)]
ACPI: remove cpumask_t usage
set_cpus_allowed() is on the way out; replace it with
set_cpus_allowed_ptr().
Reference: http://lkml.org/lkml/2008/11/6/448
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Nobuhiro Iwamatsu [Mon, 15 Jun 2009 03:16:54 +0000 (12:16 +0900)]
cpumask: Remove mask field from comments
By
7be23e278f, mask field was deleted by irqaction. However, it was not
deleted from comment.
Signed-off-by: Nobuhiro Iwamatsu <iwamatsu.nobuhiro@renesas.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:37 +0000 (09:34 -0600)]
cpumask: remove unused mask field from struct irqaction.
Up until 1.1.83, the primitive human tribes used struct sigaction for
interrupts. The sa_mask field was overloaded to hold a pointer to the
name.
When someone created the new "struct irqaction" they carried across
the "mask" field as a kind of ancestor worship: the fact that it was
unused makes clear its spiritual significance.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:36 +0000 (09:34 -0600)]
cpumask: remove last assignment to mask field of struct irqaction.
This snuck in after the patch which removed all the others.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Rusty Russell [Thu, 24 Sep 2009 15:34:36 +0000 (09:34 -0600)]
cpumask: remove unused cpu_mask_all
It's only defined for NR_CPUS > BITS_PER_LONG; cpu_all_mask is always
defined (and const).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:35 +0000 (09:34 -0600)]
cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL.: mips
(Thanks to Al Viro for reminding me of this, via Ingo)
CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:
#define CPU_MASK_ALL (cpumask_t) { { ... } }
Taking the address of such a temporary is questionable at best,
unfortunately
321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:
#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)
Which formalizes this practice. One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).
So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR
with the modern "cpu_all_mask" (a real struct cpumask *), and remove
CPU_MASK_ALL_PTR altogether.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
Rusty Russell [Thu, 24 Sep 2009 15:34:35 +0000 (09:34 -0600)]
cpumask: remove dangerous CPU_MASK_ALL_PTR
(Thanks to Al Viro for reminding me of this, via Ingo)
CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:
#define CPU_MASK_ALL (cpumask_t) { { ... } }
Taking the address of such a temporary is questionable at best,
unfortunately
321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:
#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)
Which formalizes this practice. One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).
Now all callers are removed, we kill it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
Rusty Russell [Thu, 24 Sep 2009 15:34:26 +0000 (09:34 -0600)]
cpumask: remove obsolete node_to_cpumask now everyone uses cpumask_of_node
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:25 +0000 (09:34 -0600)]
cpumask: remove the now-obsoleted pcibus_to_cpumask(): powerpc
cpumask_of_pcibus() is the new version.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:25 +0000 (09:34 -0600)]
cpumask: remove the now-obsoleted pcibus_to_cpumask(): mips
cpumask_of_pcibus() is the new version.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Thu, 24 Sep 2009 15:34:24 +0000 (09:34 -0600)]
cpumask: remove the now-obsoleted pcibus_to_cpumask(): alpha
cpumask_of_pcibus() is the new version.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Li Zefan [Mon, 15 Jun 2009 06:58:26 +0000 (14:58 +0800)]
cpumask: use zalloc_cpumask_var() where possible
Remove open-coded zalloc_cpumask_var() and zalloc_cpumask_var_node().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Linus Torvalds [Wed, 23 Sep 2009 22:39:36 +0000 (15:39 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: add driver for Atmel AT42QT2160 Sensor Chip
Input: max7359 - use threaded IRQs
Input: add driver for Maxim MAX7359 key switch controller
Input: add driver for ADP5588 QWERTY I2C Keypad
Input: add touchscreen driver for MELFAS MCS-5000 controller
Input: add driver for OpenCores Keyboard Controller
Input: dm355evm_keys - remove dm355evm_keys_hardirq
Input: synaptics_i2c - switch to using __cancel_delayed_work()
Input: ad7879 - add support for AD7889
Input: atkbd - rely on input core to restore state on resume
Input: add generic suspend and resume for input devices
Input: libps2 - additional locking for i8042 ports
Linus Torvalds [Wed, 23 Sep 2009 22:37:02 +0000 (15:37 -0700)]
Merge git://git./linux/kernel/git/sam/kbuild-next
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next: (30 commits)
Use macros for .data.page_aligned section.
Use macros for .bss.page_aligned section.
Use new __init_task_data macro in arch init_task.c files.
kbuild: Don't define ALIGN and ENTRY when preprocessing linker scripts.
arm, cris, mips, sparc, powerpc, um, xtensa: fix build with bash 4.0
kbuild: add static to prototypes
kbuild: fail build if recordmcount.pl fails
kbuild: set -fconserve-stack option for gcc 4.5
kbuild: echo the record_mcount command
gconfig: disable "typeahead find" search in treeviews
kbuild: fix cc1 options check to ensure we do not use -fPIC when compiling
checkincludes.pl: add option to remove duplicates in place
markup_oops: use modinfo to avoid confusion with underscored module names
checkincludes.pl: provide usage helper
checkincludes.pl: close file as soon as we're done with it
ctags: usability fix
kernel hacking: move STRIP_ASM_SYMS from General
gitignore usr/initramfs_data.cpio.bz2 and usr/initramfs_data.cpio.lzma
kbuild: Check if linker supports the -X option
kbuild: introduce ld-option
...
Fix trivial conflict in scripts/basic/fixdep.c
Linus Torvalds [Wed, 23 Sep 2009 22:22:41 +0000 (15:22 -0700)]
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Propagate 'fsc' mount option through automounts
sunrpc/rpc_pipe: fix kernel-doc notation
sunrpc: xdr_xcode_hyper helpers cannot presume 64-bit alignment
NFS: Add nfs_alloc_parsed_mount_data
NFS/RPC: fix problems with reestablish_timeout and related code.
NFS: Get rid of the NFS_MOUNT_VER3 and NFS_MOUNT_TCP flags
Linus Torvalds [Wed, 23 Sep 2009 22:21:54 +0000 (15:21 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: Update documentation to add fscache related bits
9p: Add fscache support to 9p
9p: Fix the incorrect update of inode size in v9fs_file_write()
9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly.
Linus Torvalds [Wed, 23 Sep 2009 22:20:16 +0000 (15:20 -0700)]
Merge branch 'hwmon-for-linus' of git://git./linux/kernel/git/jdelvare/staging
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
hwmon: (ltc4245) Clear faults at startup
hwmon: (ltc4215) Clear faults at startup
hwmon: (coretemp) Add Lynnfield CPU
hwmon: (coretemp) Add support for Penryn mobile CPUs
hwmon: (coretemp) Fix Atom CPUs support
hwmon: Delete deprecated FSC drivers
hwmon: (adm1031) Add sysfs files for temperature offsets
Linus Torvalds [Wed, 23 Sep 2009 22:18:57 +0000 (15:18 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
SELinux: do not destroy the avc_cache_nodep
KEYS: Have the garbage collector set its timer for live expired keys
tpm-fixup-pcrs-sysfs-file-update
creds_are_invalid() needs to be exported for use by modules:
include/linux/cred.h: fix build
Fix trivial BUILD_BUG_ON-induced conflicts in drivers/char/tpm/tpm.c
Ira W. Snyder [Wed, 23 Sep 2009 20:59:44 +0000 (22:59 +0200)]
hwmon: (ltc4245) Clear faults at startup
When power is applied to the ltc4245 chip it sometimes reports spurious
faults, which are exposed as alarms in the hwmon output. Clear the fault
register when the driver is installed to clear the alarms.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Ira W. Snyder [Wed, 23 Sep 2009 20:59:43 +0000 (22:59 +0200)]
hwmon: (ltc4215) Clear faults at startup
When power is applied to the ltc4215 chip it sometimes reports spurious
faults. The faults are not yet exposed via sysfs, however it may be useful
for userspace to read the fault register directly with the i2cget command.
Clear the fault register when the driver is installed so userspace doesn't
have to worry about spurious fault indications.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Huaxu Wan [Wed, 23 Sep 2009 20:59:43 +0000 (22:59 +0200)]
hwmon: (coretemp) Add Lynnfield CPU
Add Lynnfield processor support. Lynnfield is a quad-core Nehalem
based microprocessor for Desktop market, which is introduced in
September 2009.
Signed-off-by: Huaxu Wan <huaxu.wan@linux.intel.com>
Signed-off-by: Kent Liu <kent.liu@linux.intel.com>
Acked-by: Rudolf Marek <r.marek@assembler.cz>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Rudolf Marek [Wed, 23 Sep 2009 20:59:42 +0000 (22:59 +0200)]
hwmon: (coretemp) Add support for Penryn mobile CPUs
Following patch adds support for mobile Penryn CPUs. Intel documents this
poorly. I asked the Coretemp author for some help. This is totally untested and
may not work. Please test!
Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
Cc: Huaxu Wan <huaxu.wan@linux.intel.com>
Cc: Kent Liu <kent.liu@linux.intel.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Rudolf Marek [Wed, 23 Sep 2009 20:59:42 +0000 (22:59 +0200)]
hwmon: (coretemp) Fix Atom CPUs support
Fix Atom CPUs support. Intel documents TjMax at 90 degrees C but
some Atoms may have 125 degrees C (this is undocumented speculation).
Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
Cc: Huaxu Wan <huaxu.wan@linux.intel.com>
Cc: Kent Liu <kent.liu@linux.intel.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Jean Delvare [Wed, 23 Sep 2009 20:59:42 +0000 (22:59 +0200)]
hwmon: Delete deprecated FSC drivers
The legacy fscpos and fscher drivers have been replaced by the unified
fschmd driver. The transition period is over now, we can delete them.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Ira Snyder [Wed, 23 Sep 2009 20:59:41 +0000 (22:59 +0200)]
hwmon: (adm1031) Add sysfs files for temperature offsets
The ADM1030/ADM1031 chips have temperature offset registers, for both the
local and remote temperature sensors. Following the example set forth in
the LM90/ADM1032 driver, expose the offset registers to userspace.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
David Howells [Wed, 23 Sep 2009 18:36:39 +0000 (14:36 -0400)]
NFS: Propagate 'fsc' mount option through automounts
Propagate the NFS 'fsc' mount option through NFS automounts of various types.
This is now required as commit:
commit
c02d7adf8c5429727a98bad1d039bccad4c61c50
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Mon Jun 22 15:09:14 2009 -0400
NFSv4: Replace nfs4_path_walk() with VFS path lookup in a private namespace
uses VFS-driven automounting to reach all submounts barring the root, thus
preventing fscaching from being enabled on any submount other than the root.
This patch gets around that by propagating the NFS_OPTION_FSCACHE flag across
automounts. If a uniquifier is supplied to a mount then this is propagated to
all automounts of that mount too.
Signed-off-by: David Howells <dhowells@redhat.com>
[Trond: Fixed up the definition of nfs_fscache_get_super_cookie for the
case of #undef CONFIG_NFS_FSCACHE]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Randy Dunlap [Wed, 23 Sep 2009 18:36:38 +0000 (14:36 -0400)]
sunrpc/rpc_pipe: fix kernel-doc notation
Fix kernel-doc notation (& warnings) in sunrpc/rpc_pipe.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Benny Halevy [Wed, 23 Sep 2009 18:36:38 +0000 (14:36 -0400)]
sunrpc: xdr_xcode_hyper helpers cannot presume 64-bit alignment
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Wed, 23 Sep 2009 18:36:38 +0000 (14:36 -0400)]
NFS: Add nfs_alloc_parsed_mount_data
Allocating nfs_parsed_mount_data and setting up the defaults is nearly
the same for both nfs and nfs4 mounts.
Both paths seem to use nfs_validate_transport_protocol(), so setting a
default value for nfs_server.protocol ought to be unnecessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Neil Brown [Wed, 23 Sep 2009 18:36:37 +0000 (14:36 -0400)]
NFS/RPC: fix problems with reestablish_timeout and related code.
[[resending with correct cc: - "vfs.kernel.org" just isn't right!]]
xprt->reestablish_timeout is used to cause TCP connection attempts to
back off if the connection fails so as not to hammer the network,
but to still allow immediate connections when there is no reason to
believe there is a problem.
It is not used for the first connection (when transport->sock is NULL)
but only on reconnects.
It is currently set:
a/ to 0 when xs_tcp_state_change finds a state of TCP_FIN_WAIT1
on the assumption that the client has closed the connection
so the reconnect should be immediate when needed.
b/ to at least XS_TCP_INIT_REEST_TO when xs_tcp_state_change
detects TCP_CLOSING or TCP_CLOSE_WAIT on the assumption that the
server closed the connection so a small delay at least is
required.
c/ as above when xs_tcp_state_change detects TCP_SYN_SENT, so that
it is never 0 while a connection has been attempted, else
the doubling will produce 0 and there will be no backoff.
d/ to double is value (up to a limit) when delaying a connection,
thus providing exponential backoff and
e/ to XS_TCP_INIT_REEST_TO in xs_setup_tcp as simple initialisation.
So you can see it is highly dependant on xs_tcp_state_change being
called as expected. However experimental evidence shows that
xs_tcp_state_change does not see all state changes.
("rpcdebug -m rpc trans" can help show what actually happens).
Results show:
TCP_ESTABLISHED is reported when a connection is made. TCP_SYN_SENT
is never reported, so rule 'c' above is never effective.
When the server closes the connection, TCP_CLOSE_WAIT and
TCP_LAST_ACK *might* be reported, and TCP_CLOSE is always
reported. This rule 'b' above will sometimes be effective, but
not reliably.
When the client closes the connection, it used to result in
TCP_FIN_WAIT1, TCP_FIN_WAIT2, TCP_CLOSE. However since commit
f75e674 (SUNRPC: Fix the problem of EADDRNOTAVAIL syslog floods on
reconnect) we don't see *any* events on client-close. I think this
is because xs_restore_old_callbacks is called to disconnect
xs_tcp_state_change before the socket is closed.
In any case, rule 'a' no longer applies.
So all that is left are rule d, which successfully doubles the
timeout which is never rest, and rule e which initialises the timeout.
Even if the rules worked as expected, there would be a problem because
a successful connection does not reset the timeout, so a sequence
of events where the server closes the connection (e.g. during failover
testing) will cause longer and longer timeouts with no good reason.
This patch:
- sets reestablish_timeout to 0 in xs_close thus effecting rule 'a'
- sets it to 0 in xs_tcp_data_ready to ensure that a successful
connection resets the timeout
- sets it to at least XS_TCP_INIT_REEST_TO after it is doubled,
thus effecting rule c
I have not reimplemented rule b and the new version of rule c
seems sufficient.
I suspect other code in xs_tcp_data_ready needs to be revised as well.
For example I don't think connect_cookie is being incremented as often
as it should be.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Wed, 23 Sep 2009 18:36:37 +0000 (14:36 -0400)]
NFS: Get rid of the NFS_MOUNT_VER3 and NFS_MOUNT_TCP flags
Keep it in the case of the legacy binary mount interface, but purge it from
the nfs_server structure.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Linus Torvalds [Wed, 23 Sep 2009 18:25:16 +0000 (11:25 -0700)]
Merge branch 'ixp4xx' of git://git./linux/kernel/git/chris/linux-2.6
* 'ixp4xx' of git://git.kernel.org/pub/scm/linux/kernel/git/chris/linux-2.6:
Add MAINTAINERS entry for ARM/INTEL IXP4xx arch support.
ixp4xx: arch_idle() documentation fixup
ixp4xx: timer and clocks cleanups
Randy Dunlap [Mon, 21 Sep 2009 18:12:03 +0000 (11:12 -0700)]
serial core: fix new kernel-doc warnings
Fix new kernel-doc warnings in serial_core.[hc] files.
Warning(include/linux/serial_core.h:485): No description found for parameter 'uport'
Warning(include/linux/serial_core.h:485): Excess function parameter 'port' description in 'uart_handle_dcd_change'
Warning(include/linux/serial_core.h:511): No description found for parameter 'uport'
Warning(include/linux/serial_core.h:511): Excess function parameter 'port' description in 'uart_handle_cts_change'
Warning(drivers/serial/serial_core.c:2437): No description found for parameter 'uport'
Warning(drivers/serial/serial_core.c:2437): Excess function parameter 'port' description in 'uart_add_one_port'
Warning(drivers/serial/serial_core.c:2509): No description found for parameter 'uport'
Warning(drivers/serial/serial_core.c:2509): Excess function parameter 'port' description in 'uart_remove_one_port'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Paris [Mon, 21 Sep 2009 01:21:10 +0000 (21:21 -0400)]
SELinux: do not destroy the avc_cache_nodep
The security_ops reset done when SELinux is disabled at run time is done
after the avc cache is freed and after the kmem_cache for the avc is also
freed. This means that between the time the selinux disable code destroys
the avc_node_cachep another process could make a security request and could
try to allocate from the cache. We are just going to leave the cachep around,
like we always have.
SELinux: Disabled at runtime.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<
ffffffff81122537>] kmem_cache_alloc+0x9a/0x185
PGD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file:
CPU 1
Modules linked in:
Pid: 12, comm: khelper Not tainted 2.6.31-tip-05525-g0eeacc6-dirty #14819
System Product Name
RIP: 0010:[<
ffffffff81122537>] [<
ffffffff81122537>]
kmem_cache_alloc+0x9a/0x185
RSP: 0018:
ffff88003f9258b0 EFLAGS:
00010086
RAX:
0000000000000001 RBX:
0000000000000000 RCX:
0000000078c0129e
RDX:
0000000000000000 RSI:
ffffffff8130b626 RDI:
ffffffff81122528
RBP:
ffff88003f925900 R08:
0000000078c0129e R09:
0000000000000001
R10:
0000000000000000 R11:
0000000078c0129e R12:
0000000000000246
R13:
0000000000008020 R14:
ffff88003f8586d8 R15:
0000000000000001
FS:
0000000000000000(0000) GS:
ffff880002b00000(0000)
knlGS:
0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0:
000000008005003b
CR2:
0000000000000000 CR3:
0000000001001000 CR4:
00000000000006e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
ffffffff827bd420 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Process khelper (pid: 12, threadinfo
ffff88003f924000, task
ffff88003f928000)
Stack:
0000000000000246 0000802000000246 ffffffff8130b626 0000000000000001
<0>
0000000078c0129e 0000000000000000 ffff88003f925a70 0000000000000002
<0>
0000000000000001 0000000000000001 ffff88003f925960 ffffffff8130b626
Call Trace:
[<
ffffffff8130b626>] ? avc_alloc_node+0x36/0x273
[<
ffffffff8130b626>] avc_alloc_node+0x36/0x273
[<
ffffffff8130b545>] ? avc_latest_notif_update+0x7d/0x9e
[<
ffffffff8130b8b4>] avc_insert+0x51/0x18d
[<
ffffffff8130bcce>] avc_has_perm_noaudit+0x9d/0x128
[<
ffffffff8130bf20>] avc_has_perm+0x45/0x88
[<
ffffffff8130f99d>] current_has_perm+0x52/0x6d
[<
ffffffff8130fbb2>] selinux_task_create+0x2f/0x45
[<
ffffffff81303bf7>] security_task_create+0x29/0x3f
[<
ffffffff8105c6ba>] copy_process+0x82/0xdf0
[<
ffffffff81091578>] ? register_lock_class+0x2f/0x36c
[<
ffffffff81091a13>] ? mark_lock+0x2e/0x1e1
[<
ffffffff8105d596>] do_fork+0x16e/0x382
[<
ffffffff81091578>] ? register_lock_class+0x2f/0x36c
[<
ffffffff810d9166>] ? probe_workqueue_execution+0x57/0xf9
[<
ffffffff81091a13>] ? mark_lock+0x2e/0x1e1
[<
ffffffff810d9166>] ? probe_workqueue_execution+0x57/0xf9
[<
ffffffff8100cdb2>] kernel_thread+0x82/0xe0
[<
ffffffff81078b1f>] ? ____call_usermodehelper+0x0/0x139
[<
ffffffff8100ce10>] ? child_rip+0x0/0x20
[<
ffffffff81078aea>] ? __call_usermodehelper+0x65/0x9a
[<
ffffffff8107a5c7>] run_workqueue+0x171/0x27e
[<
ffffffff8107a573>] ? run_workqueue+0x11d/0x27e
[<
ffffffff81078a85>] ? __call_usermodehelper+0x0/0x9a
[<
ffffffff8107a7bc>] worker_thread+0xe8/0x10f
[<
ffffffff810808e2>] ? autoremove_wake_function+0x0/0x63
[<
ffffffff8107a6d4>] ? worker_thread+0x0/0x10f
[<
ffffffff8108042e>] kthread+0x91/0x99
[<
ffffffff8100ce1a>] child_rip+0xa/0x20
[<
ffffffff8100c754>] ? restore_args+0x0/0x30
[<
ffffffff8108039d>] ? kthread+0x0/0x99
[<
ffffffff8100ce10>] ? child_rip+0x0/0x20
Code: 0f 85 99 00 00 00 9c 58 66 66 90 66 90 49 89 c4 fa 66 66 90 66 66 90
e8 83 34 fb ff e8 d7 e9 26 00 48 98 49 8b 94 c6 10 01 00 00 <48> 8b 1a 44
8b 7a 18 48 85 db 74 0f 8b 42 14 48 8b 04 c3 ff 42
RIP [<
ffffffff81122537>] kmem_cache_alloc+0x9a/0x185
RSP <
ffff88003f9258b0>
CR2:
0000000000000000
---[ end trace
42f41a982344e606 ]---
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
David Howells [Wed, 16 Sep 2009 14:54:14 +0000 (15:54 +0100)]
KEYS: Have the garbage collector set its timer for live expired keys
The key garbage collector sets a timer to start a new collection cycle at the
point the earliest key to expire should be considered garbage. However, it
currently only does this if the key it is considering hasn't yet expired.
If the key being considering has expired, but hasn't yet reached the collection
time then it is ignored, and won't be collected until some other key provokes a
round of collection.
Make the garbage collector set the timer for the earliest key that hasn't yet
passed its collection time, rather than the earliest key that hasn't yet
expired.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>