Josef Bacik [Sun, 17 Jul 2011 00:44:56 +0000 (20:44 -0400)]
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik [Mon, 18 Jul 2011 17:21:39 +0000 (13:21 -0400)]
drivers: fix up various ->llseek() implementations
Fix up a few ->llseek() implementations that won't deal with SEEK_HOLE/SEEK_DATA
properly. Make them future proof so that if we ever add new options they will
return -EINVAL. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik [Mon, 18 Jul 2011 17:21:38 +0000 (13:21 -0400)]
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik [Mon, 18 Jul 2011 17:21:37 +0000 (13:21 -0400)]
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik [Mon, 18 Jul 2011 17:21:36 +0000 (13:21 -0400)]
Btrfs: implement our own ->llseek
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik [Mon, 18 Jul 2011 17:21:35 +0000 (13:21 -0400)]
fs: add SEEK_HOLE and SEEK_DATA flags
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. We need to match solaris
here, and the definitions are
*o* If /whence/ is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.
*o* If /whence/ is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.
So in the generic case the entire file is data and there is a virtual hole at
the end. That means we will just return i_size for SEEK_HOLE and will return
the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
the same way.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Sat, 16 Jul 2011 20:47:00 +0000 (16:47 -0400)]
reiserfs: make reiserfs default to barrier=flush
Change the default reiserfs mount option to barrier=flush. Based on a patch
from Jeff Mahoney in the SuSE tree.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Sat, 16 Jul 2011 20:46:50 +0000 (16:46 -0400)]
ext3: make ext3 mount default to barrier=1
This patch turns on barriers by default for ext3. mount -o barrier=0
will turn them off. Based on a patch from Chris Mason in the SuSE tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Ted Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 17 Jul 2011 15:19:44 +0000 (11:19 -0400)]
don't open-code parent_ino() in assorted ->readdir()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 16 Jul 2011 21:43:09 +0000 (17:43 -0400)]
minix_getattr(): don't bother with ->d_parent
we can find superblock easier, TYVM...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 16 Jul 2011 21:06:30 +0000 (17:06 -0400)]
coda_venus_readdir(): use offsetof()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 16 Jul 2011 16:41:29 +0000 (12:41 -0400)]
arm: don't create useless copies to pass into debugfs_create_dir()
its first argument is const char * and it's really not modified...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 16 Jul 2011 16:37:57 +0000 (12:37 -0400)]
switch assorted clock drivers to debugfs_remove_recursive()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kay Sievers [Tue, 12 Jul 2011 18:48:39 +0000 (20:48 +0200)]
fs: seq_file - add event counter to simplify poll() support
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.
All current users are switched over to use the new counter.
Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:48 +0000 (14:29 -0400)]
fs: move inode_dio_done to the end_io handler
For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done. Enable this by moving
the inode_dio_done call to the end_io handler if one exist. Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers. At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:47 +0000 (14:29 -0400)]
fs: simplify the blockdev_direct_IO prototype
Simple filesystems always pass inode->i_sb_bdev as the block device
argument, and never need a end_io handler. Let's simply things for
them and for my grepping activity by dropping these arguments. The
only thing not falling into that scheme is ext4, which passes and
end_io handler without needing special flags (yet), but given how
messy the direct I/O code there is use of __blockdev_direct_IO
in one instead of two out of three cases isn't going to make a large
difference anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:46 +0000 (14:29 -0400)]
fs: always maintain i_dio_count
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code. Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.
Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:45 +0000 (14:29 -0400)]
fs: move inode_dio_wait calls into ->setattr
Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:44 +0000 (14:29 -0400)]
rw_semaphore: remove up/down_read_non_owner
Now that the last users is gone these can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:43 +0000 (14:29 -0400)]
fs: kill i_alloc_sem
i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.
Replace it with a hand-grown construct:
- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.
This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:42 +0000 (14:29 -0400)]
fs: simplify handling of zero sized reads in __blockdev_direct_IO
Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Fri, 24 Jun 2011 18:29:41 +0000 (14:29 -0400)]
ext4: Rewrite ext4_page_mkwrite() to use generic helpers
Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.
Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christoph Hellwig [Fri, 24 Jun 2011 18:29:40 +0000 (14:29 -0400)]
fat: remove i_alloc_sem abuse
Add a new rw_semaphore to protect bmap against truncate. Previous
i_alloc_sem was abused for this, but it's going away in this series.
Note that we can't simply use i_mutex, given that the swapon code
calls ->bmap under it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tobias Klauser [Fri, 1 Jul 2011 11:44:51 +0000 (13:44 +0200)]
VFS: Fixup kerneldoc for generic_permission()
The flags parameter went away in
d749519b444db985e40b897f73ce1898b11f997e
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tomasz Stanislawski [Tue, 12 Jul 2011 09:27:20 +0000 (11:27 +0200)]
anonfd: fix missing declaration
The forward declaration of struct file_operations is
added to avoid compilation warnings.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:46 +0000 (14:14 +1000)]
xfs: make use of new shrinker callout for the inode cache
Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:45 +0000 (14:14 +1000)]
vfs: increase shrinker batch size
Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.
To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.
To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:44 +0000 (14:14 +1000)]
superblock: add filesystem shrinker operations
Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.
Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:43 +0000 (14:14 +1000)]
inode: remove iprune_sem
Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:42 +0000 (14:14 +1000)]
superblock: introduce per-sb cache shrinker infrastructure
With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.
With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:41 +0000 (14:14 +1000)]
superblock: move pin_sb_for_writeback() to fs/super.c
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.
pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:40 +0000 (14:14 +1000)]
inode: move to per-sb LRU locks
With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:39 +0000 (14:14 +1000)]
inode: Make unused inode LRU per superblock
The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.
To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.
The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:38 +0000 (14:14 +1000)]
inode: convert inode_stat.nr_unused to per-cpu counters
Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:37 +0000 (14:14 +1000)]
vmscan: add customisable shrinker batch size
For shrinkers that have their own cond_resched* calls, having
shrink_slab break the work down into small batches is not
paticularly efficient. Add a custom batchsize field to the struct
shrinker so that shrinkers can use a larger batch size if they
desire.
A value of zero (uninitialised) means "use the default", so
behaviour is unchanged by this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:36 +0000 (14:14 +1000)]
vmscan: reduce wind up shrinker->nr when shrinker can't do work
When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.
However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.
This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.
Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.
This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.
To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.
If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:35 +0000 (14:14 +1000)]
vmscan: shrinker->nr updates race and go wrong
shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.
As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!
Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:34 +0000 (14:14 +1000)]
vmscan: add shrink_slab tracepoints
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 9 Jul 2011 01:20:11 +0000 (21:20 -0400)]
make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
... and simplify the living hell out of callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 9 Jul 2011 00:57:47 +0000 (20:57 -0400)]
deuglify squashfs_lookup()
d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
so no need for that if (inode) ... in there (or ERR_PTR(0), for that
matter)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 7 Jul 2011 22:43:21 +0000 (18:43 -0400)]
nfsd4_list_rec_dir(): don't bother with reopening rec_file
just rewind it to the beginning before vfs_readdir() and be
done with that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 7 Jul 2011 19:45:59 +0000 (15:45 -0400)]
kill useless checks for sb->s_op == NULL
never is...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 7 Jul 2011 19:44:25 +0000 (15:44 -0400)]
btrfs: kill magical embedded struct superblock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 7 Jul 2011 19:12:51 +0000 (15:12 -0400)]
get rid of pointless checks for dentry->sb == NULL
it never is...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 7 Jul 2011 19:03:58 +0000 (15:03 -0400)]
Make ->d_sb assign-once and always non-NULL
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly. Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want). d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.
Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 27 Jun 2011 21:14:56 +0000 (17:14 -0400)]
unexport kern_path_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 27 Jun 2011 21:00:37 +0000 (17:00 -0400)]
switch vfs_path_lookup() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 27 Jun 2011 20:53:43 +0000 (16:53 -0400)]
kill lookup_create()
folded into the only caller (kern_path_create())
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 27 Jun 2011 20:37:12 +0000 (16:37 -0400)]
devtmpfs: get rid of bogus mkdir in create_path()
We do _NOT_ want to mkdir the path itself - we are preparing to
mknod it, after all. Normally it'll fail with -ENOENT and
just do nothing, but if somebody has created the parent in
the meanwhile, we'll get buggered...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 27 Jun 2011 20:35:45 +0000 (16:35 -0400)]
switch devtmpfs to kern_path_create()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 27 Jun 2011 20:25:29 +0000 (16:25 -0400)]
switch devtmpfs object creation/removal to separate kernel thread
... and give it a namespace where devtmpfs would be mounted on root,
thus avoiding abuses of vfs_path_lookup() (it was never intended to
be used with LOOKUP_PARENT). Games with credentials are also gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 28 Jun 2011 19:41:10 +0000 (15:41 -0400)]
make sure that nsproxy_cache is initialized early enough
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 15:54:58 +0000 (11:54 -0400)]
switch do_spufs_create() to user_path_create(), fix double-unlock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 15:50:15 +0000 (11:50 -0400)]
new helpers: kern_path_create/user_path_create
combination of kern_path_parent() and lookup_create(). Does *not*
expose struct nameidata to caller. Syscalls converted to that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:59:52 +0000 (21:59 -0400)]
kill LOOKUP_CONTINUE
LOOKUP_PARENT is equivalent to it now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:48:43 +0000 (21:48 -0400)]
nfs: LOOKUP_{OPEN,CREATE,EXCL} is set only on the last step
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:45:21 +0000 (21:45 -0400)]
cifs_lookup(): LOOKUP_OPEN is set only on the last component
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:43:56 +0000 (21:43 -0400)]
ceph: LOOKUP_OPEN is set only when it's the last component
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:41:09 +0000 (21:41 -0400)]
jfs_ci_revalidate() is safe from RCU mode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:37:18 +0000 (21:37 -0400)]
LOOKUP_CREATE and LOOKUP_RENAME_TARGET can be set only on the last step
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:17:17 +0000 (21:17 -0400)]
no need to check for LOOKUP_OPEN in ->create() instances
... it will be set in nd->flag for all cases with non-NULL nd
(i.e. when called from do_last()).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 26 Jun 2011 01:08:31 +0000 (21:08 -0400)]
don't pass nameidata to vfs_create() from ecryptfs_create()
Instead of playing with removal of LOOKUP_OPEN, mangling (and
restoring) nd->path, just pass NULL to vfs_create(). The whole
point of what's being done there is to suppress any attempts
to open file by underlying fs, which is what nd == NULL indicates.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 25 Jun 2011 23:15:54 +0000 (19:15 -0400)]
don't transliterate lower bits of ->intent.open.flags to FMODE_...
->create() instances are much happier that way...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 23 Jun 2011 16:35:50 +0000 (12:35 -0400)]
Don't pass nameidata when calling vfs_create() from mknod()
All instances can cope with that now (and ceph one actually
starts working properly).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 22 Jun 2011 22:53:18 +0000 (18:53 -0400)]
fix mknod() on nfs4 (hopefully)
a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE)
b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0
c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN;
otherwise we need to issue CLOSE, lest we leak stateid on server.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 22 Jun 2011 22:47:28 +0000 (18:47 -0400)]
nameidata_to_nfs_open_context() doesn't need nameidata, actually...
just open flags; switched to passing just those and
renamed to create_nfs_open_context()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 22 Jun 2011 22:40:12 +0000 (18:40 -0400)]
nfs_open_context doesn't need struct path either
just dentry, please...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 22 Jun 2011 22:30:55 +0000 (18:30 -0400)]
nfs4_opendata doesn't need struct path either
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 22 Jun 2011 22:20:23 +0000 (18:20 -0400)]
nfs4_closedata doesn't need to mess with struct path
instead of path_get()/path_put(), we can just use nfs_sb_{,de}active()
to pin the superblock down.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 21 Jun 2011 12:51:28 +0000 (08:51 -0400)]
cifs: fix the type of cifs_demultiplex_thread()
... and get rid of a bogus typecast, while we are at it; it's not
just that we want a function returning int and not void, but cast
to pointer to function taking void * and returning void would be
(void (*)(void *)) and not (void *)(void *), TYVM...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 21 Jun 2011 05:01:59 +0000 (01:01 -0400)]
ecryptfs_inode_permission() doesn't need to bail out on RCU
... now that inode_permission() can take MAY_NOT_BLOCK and handle it
properly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 21 Jun 2011 05:01:22 +0000 (01:01 -0400)]
kill IPERM_FLAG_RCU
not used anymore
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 21 Jun 2011 01:56:31 +0000 (21:56 -0400)]
->permission() sanitizing: document API changes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 14:55:26 +0000 (10:55 -0400)]
merge do_revalidate() into its only caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:57:03 +0000 (19:57 -0400)]
no reason to keep exec_permission() separate now
cache footprint alone makes it a bad idea...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:55:42 +0000 (19:55 -0400)]
massage generic_permission() to treat directories on a separate path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:48:41 +0000 (19:48 -0400)]
->permission() sanitizing: don't pass flags to exec_permission()
pass mask instead; kill security_inode_exec_permission() since we can use
security_inode_permission() instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:44:08 +0000 (19:44 -0400)]
selinux: don't transliterate MAY_NOT_BLOCK to IPERM_FLAG_RCU
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:38:15 +0000 (19:38 -0400)]
->permission() sanitizing: don't pass flags to ->inode_permission()
pass that via mask instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:28:19 +0000 (19:28 -0400)]
->permission() sanitizing: don't pass flags to ->permission()
not used by the instances anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:16:29 +0000 (19:16 -0400)]
->permission() sanitizing: don't pass flags to generic_permission()
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:12:17 +0000 (19:12 -0400)]
->permission() sanitizing: don't pass flags to ->check_acl()
not used in the instances anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 23:06:22 +0000 (19:06 -0400)]
->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 22:59:02 +0000 (18:59 -0400)]
->permission() sanitizing: MAY_NOT_BLOCK
Duplicate the flags argument into mask bitmap.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 15:31:30 +0000 (11:31 -0400)]
kill check_acl callback of generic_permission()
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Mon, 20 Jun 2011 14:52:57 +0000 (10:52 -0400)]
lockless get_write_access/deny_write_access
new helpers: atomic_inc_unless_negative()/atomic_dec_unless_positive()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 19 Jun 2011 17:14:21 +0000 (13:14 -0400)]
move exec_permission() up to the rest of permission-related functions
... and convert the comment before it into linuxdoc form.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 19 Jun 2011 16:55:10 +0000 (12:55 -0400)]
kill file_permission() completely
convert the last remaining caller to inode_permission()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 19 Jun 2011 16:49:47 +0000 (12:49 -0400)]
consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling
new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 19 Jun 2011 15:54:42 +0000 (11:54 -0400)]
switch path_init() to exec_permission()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 19 Jun 2011 15:49:08 +0000 (11:49 -0400)]
switch udf_ioctl() to inode_permission()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 19 Jun 2011 05:50:08 +0000 (01:50 -0400)]
make exec_permission(dir) really equivalent to inode_permission(dir, MAY_EXEC)
capability overrides apply only to the default case; if fs has ->permission()
that does _not_ call generic_permission(), we have no business doing them.
Moreover, if it has ->permission() that does call generic_permission(), we
have no need to recheck capabilities.
Besides, the capability overrides should apply only if we got EACCES from
acl_permission_check(); any other value (-EIO, etc.) should be returned
to caller, capabilities or not capabilities.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 4 Jun 2011 00:16:57 +0000 (20:16 -0400)]
new helper: iterate_supers_type()
Call the given function for all superblocks of given type. Function
gets a superblock (with s_umount locked shared) and (void *) argument
supplied by caller of iterator.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Josef Bacik [Tue, 31 May 2011 15:58:49 +0000 (11:58 -0400)]
fs: add a DCACHE_NEED_LOOKUP flag for d_flags
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this
http://people.redhat.com/jwhiter/ls-noreada.png
To this
http://people.redhat.com/jwhiter/ls-good.png
Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Wed, 20 Jul 2011 05:10:28 +0000 (22:10 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix file mode calculation
Linus Torvalds [Wed, 20 Jul 2011 05:10:05 +0000 (22:10 -0700)]
Merge branch 'fixes' of git://git./linux/kernel/git/arm/linux-arm-soc
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc:
davinci: DM365 EVM: fix video input mux bits
ARM: davinci: Check for NULL return from irq_alloc_generic_chip
arm: davinci: Fix low level gpio irq handlers' argument
Shaohua Li [Tue, 19 Jul 2011 15:49:26 +0000 (08:49 -0700)]
vmscan: fix a livelock in kswapd
I'm running a workload which triggers a lot of swap in a machine with 4
nodes. After I kill the workload, I found a kswapd livelock. Sometimes
kswapd3 or kswapd2 are keeping running and I can't access filesystem,
but most memory is free.
This looks like a regression since commit
08951e545918c159 ("mm: vmscan:
correct check for kswapd sleeping in sleeping_prematurely").
Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
for classzone_idx. The reason is end_zone in balance_pgdat() is 0 by
default, if all zones have watermark ok, end_zone will keep 0.
Later sleeping_prematurely() always returns true. Because this is an
order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
present_pages in pgdat_balanced() are 0. We add a special case here.
If a zone has no page, we think it's balanced. This fixes the livelock.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Tue, 19 Jul 2011 15:49:25 +0000 (08:49 -0700)]
fs/libfs.c: fix simple_attr_write() on 32bit machines
Assume that /sys/kernel/debug/dummy64 is debugfs file created by
debugfs_create_x64().
# cd /sys/kernel/debug
# echo 0x1234567812345678 > dummy64
# cat dummy64
0x0000000012345678
# echo 0x80000000 > dummy64
# cat dummy64
0xffffffff80000000
A value larger than INT_MAX cannot be written to the debugfs file created
by debugfs_create_u64 or debugfs_create_x64 on 32bit machine. Because
simple_attr_write() uses simple_strtol() for the conversion.
To fix this, use simple_strtoll() instead.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 20 Jul 2011 04:50:21 +0000 (21:50 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: fix race in rcu lookup of pruned dentry
Fix cifs_get_root()
[ Edited the last commit to get rid of a 'unused variable "seq"'
warning due to Al editing the patch. - Linus ]
Linus Torvalds [Mon, 18 Jul 2011 22:43:29 +0000 (15:43 -0700)]
vfs: fix race in rcu lookup of pruned dentry
Don't update *inode in __follow_mount_rcu() until we'd verified that
there is mountpoint there. Kudos to Hugh Dickins for catching that
one in the first place and eventually figuring out the solution (and
catching a braino in the earlier version of patch).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>