Dave Chinner [Mon, 11 Feb 2013 04:58:13 +0000 (15:58 +1100)]
xfs: xfs_bmap_add_attrfork_local is too generic
When we are converting local data to an extent format as a result of
adding an attribute, the type of data contained in the local fork
determines the behaviour that needs to occur.
xfs_bmap_add_attrfork_local() already handles the directory data
case specially by using S_ISDIR() and calling out to
xfs_dir2_sf_to_block(), but with verifiers we now need to handle
each different type of metadata specially and different metadata
formats require different verifiers (and eventually block header
initialisation).
There is only a single place that we add and attribute fork to
the inode, but that is in the attribute code and it knows nothing
about the specific contents of the data fork. It is only the case of
local data that is the issue here, so adding code to hadnle this
case in the attribute specific code is wrong. Hence we are really
stuck trying to detect the data fork contents in
xfs_bmap_add_attrfork_local() and performing the correct callout
there.
Luckily the current cases can be determined by S_IS* macros, and we
can push the work off to data specific callouts, but each of those
callouts does a lot of work in common with
xfs_bmap_local_to_extents(). The only reason that this fails for
symlinks right now is is that xfs_bmap_local_to_extents() assumes
the data fork contains extent data, and so attaches a a bmap extent
data verifier to the buffer and simply copies the data fork
information straight into it.
To fix this, allow us to pass a "formatting" callback into
xfs_bmap_local_to_extents() which is responsible for setting the
buffer type, initialising it and copying the data fork contents over
to the new buffer. This allows callers to specify how they want to
format the new buffer (which is necessary for the upcoming CRC
enabled metadata blocks) and hence make xfs_bmap_local_to_extents()
useful for any type of data fork content.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Brian Foster [Mon, 11 Feb 2013 15:08:22 +0000 (10:08 -0500)]
xfs: remove log force from xfs_buf_trylock()
The trylock log force invoked via xfs_buf_item_push() can attempt
to acquire xa_lock, thus leading to a recursion bug when called
with xa_lock held.
This log force was originally added to xfs_buf_trylock() to address
xfsaild stalls due to pinned and stale buffers. Since the addition
of this behavior, the log item pushing code had been reworked to
detect and track pinned items to inform xfsaild to issue a log
force itself when necessary. As such, the log force on trylock
failure is redundant and safe to remove.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Brian Foster [Mon, 11 Feb 2013 15:08:21 +0000 (10:08 -0500)]
xfs: recheck buffer pinned status after push trylock failure
The buffer pinned check and trylock sequence in xfs_buf_item_push()
can race with an active transaction on marking the buffer pinned.
This can result in the buffer becoming pinned and stale after the
initial check and the trylock failure, but before the check in
xfs_buf_trylock() that issues a log force. If the log force is
issued from this context, a spinlock recursion occurs on xa_lock.
Prepare xfs_buf_item_push() to handle the race by detecting a
pinned buffer after the trylock failure so xfsaild issues a log
force from a safe context. This, along with various previous fixes,
renders the log force in xfs_buf_trylock() redundant.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Chinner [Mon, 11 Feb 2013 05:05:01 +0000 (16:05 +1100)]
xfs: limit speculative prealloc size on sparse files
Speculative preallocation based on the current file size works well
for contiguous files, but is sub-optimal for sparse files where the
EOF preallocation can fill holes and result in large amounts of
zeros being written when it is not necessary.
The algorithm is modified to prevent EOF speculative preallocation
from triggering larger allocations on IO patterns of
truncate--to-zero-seek-write-seek-write-.... which results in
non-sparse files for large files. This, unfortunately, is the way cp
now behaves when copying sparse files and so needs to be fixed.
What this code does is that it looks at the existing extent adjacent
to the current EOF and if it determines that it is a hole we disable
speculative preallocation altogether. To avoid the next write from
doing a large prealloc, it takes the size of subsequent
preallocations from the current size of the existing EOF extent.
IOWs, if you leave a hole in the file, it resets preallocation
behaviour to the same as if it was a zero size file.
Example new behaviour:
$ xfs_io -f -c "pwrite 0 31m" \
-c "pwrite 33m 1m" \
-c "pwrite 128m 1m" \
-c "fiemap -v" /mnt/scratch/blah
wrote
32505856/
32505856 bytes at offset 0
31 MiB, 7936 ops; 0.0000 sec (1.608 GiB/sec and 421432.7439 ops/sec)
wrote 1048576/1048576 bytes at offset
34603008
1 MiB, 256 ops; 0.0000 sec (1.462 GiB/sec and 383233.5329 ops/sec)
wrote 1048576/1048576 bytes at offset
134217728
1 MiB, 256 ops; 0.0000 sec (1.719 GiB/sec and 450704.2254 ops/sec)
/mnt/scratch/blah:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..65535]: 96..65631 65536 0x0
1: [65536..67583]: hole 2048
2: [67584..69631]: 67680..69727 2048 0x0
3: [69632..262143]: hole 192512
4: [262144..264191]: 262240..264287 2048 0x1
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Alex Elder [Mon, 4 Feb 2013 16:13:11 +0000 (10:13 -0600)]
xfs: memory barrier before wake_up_bit()
In xfs_ifunlock() there is a call to wake_up_bit() after clearing
the flush lock on the xfs inode. This is not guaranteed to be safe,
as noted in the comments above wake_up_bit() beginning with:
In order for this to function properly, as it uses
waitqueue_active() internally, some kind of memory
barrier must be done prior to calling this.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:53 +0000 (21:27 +0800)]
xfs: refactor space log reservation for XFS_TRANS_ATTR_SET
Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:
1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.
2) ((ext * (mp)->m_sb.sb_sectsize) + \
(ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
(128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.
This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.
In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:46 +0000 (21:27 +0800)]
xfs: make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy()
Make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:39 +0000 (21:27 +0800)]
xfs: make use of XFS_SB_LOG_RES() at xfs_mount_log_sb()
Make use of XFS_SB_LOG_RES() at xfs_mount_log_sb().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:31 +0000 (21:27 +0800)]
xfs: make use of XFS_SB_LOG_RES() at xfs_log_sbcount()
Make use of XFS_SB_LOG_RES() at xfs_log_sbcount().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:25 +0000 (21:27 +0800)]
xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk
Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:21 +0000 (21:27 +0800)]
xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time
Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:15 +0000 (21:27 +0800)]
xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time
Convert the calculation of quota off transaction log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:27:04 +0000 (21:27 +0800)]
xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time
The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:26:49 +0000 (21:26 +0800)]
xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time
For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:26:34 +0000 (21:26 +0800)]
xfs: calculate xfs_qm_write_sb_changes() space log reservation at mount time
For the transaction that write the incore superblock changes of quota flags
to disk, it would reserve the same log space to clear/reset quota flags
transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:26:16 +0000 (21:26 +0800)]
xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time
The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Fri, 1 Feb 2013 20:39:29 +0000 (14:39 -0600)]
xfs: make use of xfs_calc_buf_res() in xfs_trans.c
Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jeff Liu [Mon, 28 Jan 2013 13:25:35 +0000 (21:25 +0800)]
xfs: add a helper to figure out the space log reservation per item
Add a new helper xfs_calc_buf_res() to calcuate out the transaction space
reservations per item. xfs_buf_log_overhead() is used to figure out the
extra space for struct xfs_buf_log_format that gets written into the log
for every buffer as well as a log opheader, i.e. struct xlog_op_header.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Torsten Kaiser [Sun, 20 Jan 2013 09:24:49 +0000 (10:24 +0100)]
xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
Commit
fb59581404ab7ec5075299065c22cb211a9262a9 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.
But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'
Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Jan Kara [Wed, 23 Jan 2013 12:56:18 +0000 (13:56 +0100)]
xfs: Fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Chinner [Mon, 21 Jan 2013 12:53:55 +0000 (23:53 +1100)]
xfs: fix shutdown hang on invalid inode during create
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:
[ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:
xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.
The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.
The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Chinner [Mon, 21 Jan 2013 12:53:54 +0000 (23:53 +1100)]
xfs: limit speculative prealloc near ENOSPC thresholds
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.
Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Dave Chinner [Mon, 21 Jan 2013 12:53:52 +0000 (23:53 +1100)]
xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.
In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported. Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Ben Myers [Fri, 18 Jan 2013 20:17:46 +0000 (14:17 -0600)]
xfs: fix fs/xfs/xfs_log.c:1740:39: error: 'B_TRUE' undeclared
Commit
667a9291c5b3 "xfs: Remove boolean_t typedef completely." didn't.
Remove a stray B_TRUE that breaks CONFIG_XFS_DEBUG=y.
Signed-off-by: Ben Myers <bpm@sgi.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Brian Foster [Thu, 17 Jan 2013 18:11:29 +0000 (13:11 -0500)]
xfs: pull up stack_switch check into xfs_bmapi_write
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Thiago Farina [Mon, 12 Nov 2012 23:32:59 +0000 (21:32 -0200)]
xfs: Remove boolean_t typedef completely.
Since we are using C99 we have one builtin defined in include/linux/types.h,
use that instead.
v2: you missed one in fs/xfs/xfs_qm_bhv.c, cleaned up. -bpm
Signed-off-by: Thiago Farina <tfarina@chromium.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Eric Sandeen [Wed, 16 Jan 2013 23:33:53 +0000 (17:33 -0600)]
xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Abhijit Pawar [Wed, 9 Jan 2013 14:04:42 +0000 (19:34 +0530)]
fs/xfs remove obsolete simple_strto<foo>
This patch replaces usages of obsolete simple_strtoul with kstrtoint in
xfs_args and suffix_strtoul.
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Eric Sandeen [Thu, 10 Jan 2013 16:41:48 +0000 (10:41 -0600)]
xfs: recalculate leaf entry pointer after compacting a dir2 block
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:
XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828
Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.
memmove(&blp[mid + 1], &blp[mid],
(highstale - mid) * sizeof(*blp));
overwrote it.
What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array. So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().
Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.
Thanks also to dchinner for looking at this one with me.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Eric Sandeen [Mon, 10 Dec 2012 20:49:15 +0000 (14:49 -0600)]
xfs: don't zero structure members after a memset(0)
Commit
408cc4e97a3ccd172d2d676e4b585badf439271b
added memset(0, ...) to allocation args structures,
so there is no need to explicitly set any of the fields
to 0 after that.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Brian Foster [Fri, 21 Dec 2012 15:45:17 +0000 (10:45 -0500)]
xfs: remove int casts from debug dquot soft limit timer asserts
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Ben Myers [Mon, 31 Dec 2012 16:41:39 +0000 (10:41 -0600)]
Merge branch 'xfs-for-3.9'
Linus Torvalds [Sat, 22 Dec 2012 01:19:00 +0000 (17:19 -0800)]
Linux 3.8-rc1
Linus Torvalds [Sat, 22 Dec 2012 01:10:29 +0000 (17:10 -0800)]
Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
"This includes some fixes and code improvements (like
clk_prepare_enable and clk_disable_unprepare), conversion from the
omap_wdt and twl4030_wdt drivers to the watchdog framework, addition
of the SB8x0 chipset support and the DA9055 Watchdog driver and some
OF support for the davinci_wdt driver."
* git://www.linux-watchdog.org/linux-watchdog: (22 commits)
watchdog: mei: avoid oops in watchdog unregister code path
watchdog: Orion: Fix possible null-deference in orion_wdt_probe
watchdog: sp5100_tco: Add SB8x0 chipset support
watchdog: davinci_wdt: add OF support
watchdog: da9052: Fix invalid free of devm_ allocated data
watchdog: twl4030_wdt: Change TWL4030_MODULE_PM_RECEIVER to TWL_MODULE_PM_RECEIVER
watchdog: remove depends on CONFIG_EXPERIMENTAL
watchdog: Convert dev_printk(KERN_<LEVEL> to dev_<level>(
watchdog: DA9055 Watchdog driver
watchdog: omap_wdt: eliminate goto
watchdog: omap_wdt: delete redundant platform_set_drvdata() calls
watchdog: omap_wdt: convert to devm_ functions
watchdog: omap_wdt: convert to new watchdog core
watchdog: WatchDog Timer Driver Core: fix comment
watchdog: s3c2410_wdt: use clk_prepare_enable and clk_disable_unprepare
watchdog: imx2_wdt: Select the driver via ARCH_MXC
watchdog: cpu5wdt.c: add missing del_timer call
watchdog: hpwdt.c: Increase version string
watchdog: Convert twl4030_wdt to watchdog core
davinci_wdt: preparation for switch to common clock framework
...
Linus Torvalds [Sat, 22 Dec 2012 01:09:07 +0000 (17:09 -0800)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
"Misc small cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: eliminate cifsERROR variable
cifs: don't compare uniqueids in cifs_prime_dcache unless server inode numbers are in use
cifs: fix double-free of "string" in cifs_parse_mount_options
Linus Torvalds [Sat, 22 Dec 2012 01:08:06 +0000 (17:08 -0800)]
Merge tag 'dm-3.8-fixes' of git://git./linux/kernel/git/agk/linux-dm
Pull dm update from Alasdair G Kergon:
"Miscellaneous device-mapper fixes, cleanups and performance
improvements.
Of particular note:
- Disable broken WRITE SAME support in all targets except linear and
striped. Use it when kcopyd is zeroing blocks.
- Remove several mempools from targets by moving the data into the
bio's new front_pad area(which dm calls 'per_bio_data').
- Fix a race in thin provisioning if discards are misused.
- Prevent userspace from interfering with the ioctl parameters and
use kmalloc for the data buffer if it's small instead of vmalloc.
- Throttle some annoying error messages when I/O fails."
* tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits)
dm stripe: add WRITE SAME support
dm: remove map_info
dm snapshot: do not use map_context
dm thin: dont use map_context
dm raid1: dont use map_context
dm flakey: dont use map_context
dm raid1: rename read_record to bio_record
dm: move target request nr to dm_target_io
dm snapshot: use per_bio_data
dm verity: use per_bio_data
dm raid1: use per_bio_data
dm: introduce per_bio_data
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
dm linear: add WRITE SAME support
dm: add WRITE SAME support
dm: prepare to support WRITE SAME
dm ioctl: use kmalloc if possible
dm ioctl: remove PF_MEMALLOC
dm persistent data: improve improve space map block alloc failure message
dm thin: use DMERR_LIMIT for errors
...
J. Bruce Fields [Sat, 22 Dec 2012 00:48:59 +0000 (19:48 -0500)]
Revert "nfsd: warn on odd reply state in nfsd_vfs_read"
This reverts commit
79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e.
This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs. The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 22 Dec 2012 00:40:26 +0000 (16:40 -0800)]
Merge tag 'rdma-for-linus' of git://git./linux/kernel/git/roland/infiniband
Pull more infiniband changes from Roland Dreier:
"Second batch of InfiniBand/RDMA changes for 3.8:
- cxgb4 changes to fix lookup engine hash collisions
- mlx4 changes to make flow steering usable
- fix to IPoIB to avoid pinning dst reference for too long"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/cxgb4: Fix bug for active and passive LE hash collision path
RDMA/cxgb4: Fix LE hash collision bug for passive open connection
RDMA/cxgb4: Fix LE hash collision bug for active open connection
mlx4_core: Allow choosing flow steering mode
mlx4_core: Adjustments to Flow Steering activation logic for SR-IOV
mlx4_core: Fix error flow in the flow steering wrapper
mlx4_core: Add QPN enforcement for flow steering rules set by VFs
cxgb4: Add LE hash collision bug fix path in LLD driver
cxgb4: Add T4 filter support
IPoIB: Call skb_dst_drop() once skb is enqueued for sending
Linus Torvalds [Sat, 22 Dec 2012 00:39:08 +0000 (16:39 -0800)]
Merge tag 'asm-generic' of git://git./linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanup from Arnd Bergmann:
"These are a few cleanups for asm-generic:
- a set of patches from Lars-Peter Clausen to generalize asm/mmu.h
and use it in the architectures that don't need any special
handling.
- A patch from Will Deacon to remove the {read,write}s{b,w,l} as
discussed during the arm64 review
- A patch from James Hogan that helps with the meta architecture
series."
* tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
xtensa: Use generic asm/mmu.h for nommu
h8300: Use generic asm/mmu.h
c6x: Use generic asm/mmu.h
asm-generic/mmu.h: Add support for FDPIC
asm-generic/mmu.h: Remove unused vmlist field from mm_context_t
asm-generic: io: remove {read,write} string functions
asm-generic/io.h: remove asm/cacheflush.h include
Kukjin Kim [Fri, 21 Dec 2012 18:02:13 +0000 (10:02 -0800)]
ARM: dts: fix duplicated build target and alphabetical sort out for exynos
Commit
db5b0ae00712 ("Merge tag 'dt' of git://git.kernel.org/.../arm-soc")
causes a duplicated build target. This patch fixes it and sorts out the
build target alphabetically so that we can recognize something wrong
easily.
Cc: Olof Johansson <olof@lixom.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Snitzer [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm stripe: add WRITE SAME support
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE
SAME bio processing.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm: remove map_info
This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:41 +0000 (20:23 +0000)]
dm snapshot: do not use map_context
Eliminate struct map_info from dm-snap.
map_info->ptr was used in dm-snap to indicate if the bio was tracked.
If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash.
This patch removes the use of map_info->ptr. We determine if the bio was
tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true,
the bio is not tracked, if it is false, the bio is tracked.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:40 +0000 (20:23 +0000)]
dm thin: dont use map_context
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.
This patch removes any use of map_info in preparation for the next patch
that removes map_info from bio-based device mapper.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:40 +0000 (20:23 +0000)]
dm raid1: dont use map_context
Don't use map_info any more in dm-raid1.
map_info was used for writes to hold the region number. For this purpose
we add a new field dm_bio_details to dm_raid1_bio_record.
map_info was used for reads to hold a pointer to dm_raid1_bio_record (if
the pointer was non-NULL, bio details were saved; if the pointer was
NULL, bio details were not saved). We use
dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is
NULL, details were not saved, if bi_bdev is non-NULL, details were
saved.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:39 +0000 (20:23 +0000)]
dm flakey: dont use map_context
Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:39 +0000 (20:23 +0000)]
dm raid1: rename read_record to bio_record
Rename struct read_record to bio_record in dm-raid1.
In the following patch, the structure will be used for both read and
write bios, so rename it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:39 +0000 (20:23 +0000)]
dm: move target request nr to dm_target_io
This patch moves target_request_nr from map_info to dm_target_io and
makes it accessible with dm_bio_get_target_request_nr.
This patch is a preparation for the next patch that removes map_info.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm snapshot: use per_bio_data
Replace tracked_chunk_pool with per_bio_data in dm-snap.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm verity: use per_bio_data
Replace io_mempool with per_bio_data in dm-verity.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm raid1: use per_bio_data
Replace read_record_pool with per_bio_data in dm-raid1.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:38 +0000 (20:23 +0000)]
dm: introduce per_bio_data
Introduce a field per_bio_data_size in struct dm_target.
Targets can set this field in the constructor. If a target sets this
field to a non-zero value, "per_bio_data_size" bytes of auxiliary data
are allocated for each bio submitted to the target. These data can be
used for any purpose by the target and help us improve performance by
removing some per-target mempools.
Per-bio data is accessed with dm_per_bio_data. The
argument data_size must be the same as the value per_bio_data_size in
dm_target.
If the target has a pointer to per_bio_data, it can get a pointer to
the bio with dm_bio_from_per_bio_data() function (data_size must be the
same as the value passed to dm_per_bio_data).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.
WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.
The dm thin target uses this.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm linear: add WRITE SAME support
The linear target can already support WRITE SAME requests so signal
this by setting num_write_same_requests to 1.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:37 +0000 (20:23 +0000)]
dm: add WRITE SAME support
WRITE SAME bios have a payload that contain a single page. When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental). DM need only alter the
start and end of the WRITE SAME bio accordingly.
Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm: prepare to support WRITE SAME
Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.
A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm ioctl: use kmalloc if possible
If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().
vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.
In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.
This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm ioctl: remove PF_MEMALLOC
When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:36 +0000 (20:23 +0000)]
dm persistent data: improve improve space map block alloc failure message
Improve space map error message when unable to allocate a new
metadata block.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm thin: use DMERR_LIMIT for errors
Throttle all errors logged from the IO path by dm thin.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm persistent data: use DMERR_LIMIT for errors
Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:34 +0000 (20:23 +0000)]
dm block manager: reinstate message when validator fails
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jonathan Brassow [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm raid: round region_size to power of two
If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically. If this is not a power of 2,
the md code will report an error later.
This patch catches the problem early and rounds the region_size to the
next power of two.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm thin: cleanup dead code
Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm thin: rename cell_defer_except to cell_defer_no_holder
Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:33 +0000 (20:23 +0000)]
dm snapshot: optimize track_chunk
track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm raid: use DM_ENDIO_INCOMPLETE
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm raid1: remove impossible mempool_alloc error test
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
that tests if read_record is non-NULL is always true.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm thin: emit ignore_discard in status when discards disabled
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device. The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:32 +0000 (20:23 +0000)]
dm persistent data: fix nested btree deletion
When deleting nested btrees, the code forgets to delete the innermost
btree. The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.
This patch corrects both problems.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: wake worker when discard is prepared
When discards are prepared it is best to directly wake the worker that
will process them. The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.
Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding. DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.
The race manifests as follows:
1. A non-discard bio is mapped in thin_bio_map.
This doesn't lock out parallel activity to the same block.
2. A discard bio is issued to the same block as the non-discard bio.
3. The discard bio is locked in a dm_bio_prison_cell in process_discard
to lock out parallel activity against the same block.
4. The non-discard bio's mapping continues and its all_io_entry is
incremented so the bio is accounted for in the thin pool's all_io_ds
which is a dm_deferred_set used to track time locality of non-discard IO.
5. The non-discard bio is finally locked in a dm_bio_prison_cell in
process_bio.
The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:
INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby D
ffffffff8160f0e0 0 15354 15314 0x00000000
ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
[<
ffffffff814e5a19>] schedule+0x29/0x70
[<
ffffffff814e3d85>] schedule_timeout+0x195/0x220
[<
ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
[<
ffffffff814e589e>] wait_for_common+0x11e/0x190
[<
ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
[<
ffffffff814e59ed>] wait_for_completion+0x1d/0x20
[<
ffffffff81233289>] blkdev_issue_discard+0x219/0x260
[<
ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
[<
ffffffff8119a65c>] block_ioctl+0x3c/0x40
[<
ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
[<
ffffffff8119a547>] ? block_llseek+0x67/0xb0
[<
ffffffff811756f1>] sys_ioctl+0xa1/0xb0
[<
ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
[<
ffffffff814ef099>] system_call_fastpath+0x16/0x1b
The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.
The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks. That cell locking wasn't
occurring early enough in thin_bio_map. This patch fixes this.
Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.
Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so. Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.
This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
Joe Thornber [Fri, 21 Dec 2012 20:23:31 +0000 (20:23 +0000)]
dm thin: replace dm_cell_release_singleton with cell_defer_except
Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.
Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.
If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton. Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.
Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.
This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 21 Dec 2012 20:23:30 +0000 (20:23 +0000)]
dm: disable WRITE SAME
WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.
As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Alasdair G Kergon [Fri, 21 Dec 2012 20:23:30 +0000 (20:23 +0000)]
dm ioctl: prevent unsafe change to dm_ioctl data_size
Abort dm ioctl processing if userspace changes the data_size parameter
after we validated it but before we finished copying the data buffer
from userspace.
The dm ioctl parameters are processed in the following sequence:
1. ctl_ioctl() calls copy_params();
2. copy_params() makes a first copy of the fixed-sized portion of the
userspace parameters into the local variable "tmp";
3. copy_params() then validates tmp.data_size and allocates a new
structure big enough to hold the complete data and copies the whole
userspace buffer there;
4. ctl_ioctl() reads userspace data the second time and copies the whole
buffer into the pointer "param";
5. ctl_ioctl() reads param->data_size without any validation and stores it
in the variable "input_param_size";
6. "input_param_size" is further used as the authoritative size of the
kernel buffer.
The problem is that userspace code could change the contents of user
memory between steps 2 and 4. In particular, the data_size parameter
can be changed to an invalid value after the kernel has validated it.
This lets userspace force the kernel to access invalid kernel memory.
The fix is to ensure that the size has not changed at step 4.
This patch shouldn't have a security impact because CAP_SYS_ADMIN is
required to run this code, but it should be fixed anyway.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
Mikulas Patocka [Fri, 21 Dec 2012 20:23:30 +0000 (20:23 +0000)]
dm persistent data: rename node to btree_node
This patch fixes a compilation failure on sparc32 by renaming struct node.
struct node is already defined in include/linux/node.h. On sparc32, it
happens to be included through other dependencies and persistent-data
doesn't compile because of conflicting declarations.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Trond Myklebust [Fri, 21 Dec 2012 16:02:32 +0000 (11:02 -0500)]
NFS: Kill fscache warnings when mounting without -ofsc
The fscache code will currently bleat a "non-unique superblock keys"
warning even if the user is mounting without the 'fsc' option.
There should be no reason to even initialise the superblock cache cookie
unless we're planning on using fscache for something, so ensure that we
check for the NFS_OPTION_FSCACHE flag before calling into the fscache
code.
Reported-by: Paweł Sikora <pawel.sikora@agmk.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 21 Dec 2012 12:15:05 +0000 (12:15 +0000)]
NFS: Provide stub nfs_fscache_wait_on_invalidate() for when CONFIG_NFS_FSCACHE=n
Provide a stub nfs_fscache_wait_on_invalidate() function for when
CONFIG_NFS_FSCACHE=n lest the following error appear:
fs/nfs/inode.c: In function 'nfs_invalidate_mapping':
fs/nfs/inode.c:887:2: error: implicit declaration of function 'nfs_fscache_wait_on_invalidate' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 21 Dec 2012 05:30:12 +0000 (21:30 -0800)]
Merge tag 'vfio-for-v3.8-v2' of git://github.com/awilliam/linux-vfio
Pull vfio update from Alex Williamson.
* tag 'vfio-for-v3.8-v2' of git://github.com/awilliam/linux-vfio:
vfio-pci: Enable device before attempting reset
VFIO: fix out of order labels for error recovery in vfio_pci_init()
VFIO: use ACCESS_ONCE() to guard access to dev->driver
VFIO: unregister IOMMU notifier on error recovery path
vfio-pci: Re-order device reset
vfio: simplify kmalloc+copy_from_user to memdup_user
Linus Torvalds [Fri, 21 Dec 2012 04:11:52 +0000 (20:11 -0800)]
Merge branch 'for-next' of git://git.infradead.org/users/eparis/notify
Pull filesystem notification updates from Eric Paris:
"This pull mostly is about locking changes in the fsnotify system. By
switching the group lock from a spin_lock() to a mutex() we can now
hold the lock across things like iput(). This fixes a problem
involving unmounting a fs and having inodes be busy, first pointed out
by FAT, but reproducible with tmpfs.
This also restores signal driven I/O for inotify, which has been
broken since about 2.6.32."
Ugh. I *hate* the timing of this. It was rebased after the merge
window opened, and then left to sit with the pull request coming the day
before the merge window closes. That's just crap. But apparently the
patches themselves have been around for over a year, just gathering
dust, so now it's suddenly critical.
Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
Rothwell's fixes from -next.
* 'for-next' of git://git.infradead.org/users/eparis/notify:
inotify: automatically restart syscalls
inotify: dont skip removal of watch descriptor if creation of ignored event failed
fanotify: dont merge permission events
fsnotify: make fasync generic for both inotify and fanotify
fsnotify: change locking order
fsnotify: dont put marks on temporary list when clearing marks by group
fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
fsnotify: pass group to fsnotify_destroy_mark()
fsnotify: use a mutex instead of a spinlock to protect a groups mark list
fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
fsnotify: take groups mark_lock before mark lock
fsnotify: use reference counting for groups
fsnotify: introduce fsnotify_get_group()
inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()
Linus Torvalds [Fri, 21 Dec 2012 04:00:43 +0000 (20:00 -0800)]
Merge branch 'akpm' (Andrew's patch-bomb)
Merge the rest of Andrew's patches for -rc1:
"A bunch of fixes and misc missed-out-on things.
That'll do for -rc1. I still have a batch of IPC patches which still
have a possible bug report which I'm chasing down."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
keys: use keyring_alloc() to create module signing keyring
keys: fix unreachable code
sendfile: allows bypassing of notifier events
SGI-XP: handle non-fatal traps
fat: fix incorrect function comment
Documentation: ABI: remove testing/sysfs-devices-node
proc: fix inconsistent lock state
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
memcg: don't register hotcpu notifier from ->css_alloc()
checkpatch: warn on uapi #includes that #include <uapi/...
revert "rtc: recycle id when unloading a rtc driver"
mm: clean up transparent hugepage sysfs error messages
hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
hfsplus: rework processing of hfs_btree_write() returned error
hfsplus: rework processing errors in hfsplus_free_extents()
hfsplus: avoid crash on failed block map free
kcmp: include linux/ptrace.h
drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
mm: cma: WARN if freed memory is still in use
exec: do not leave bprm->interp on stack
...
Linus Torvalds [Fri, 21 Dec 2012 02:14:31 +0000 (18:14 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull VFS update from Al Viro:
"fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
misc stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
vfs: make lremovexattr retry once on ESTALE error
vfs: make removexattr retry once on ESTALE
vfs: make llistxattr retry once on ESTALE error
vfs: make listxattr retry once on ESTALE error
vfs: make lgetxattr retry once on ESTALE
vfs: make getxattr retry once on an ESTALE error
vfs: allow lsetxattr() to retry once on ESTALE errors
vfs: allow setxattr to retry once on ESTALE errors
vfs: allow utimensat() calls to retry once on an ESTALE error
vfs: fix user_statfs to retry once on ESTALE errors
vfs: make fchownat retry once on ESTALE errors
vfs: make fchmodat retry once on ESTALE errors
vfs: have chroot retry once on ESTALE error
vfs: have chdir retry lookup and call once on ESTALE error
vfs: have faccessat retry once on an ESTALE error
vfs: have do_sys_truncate retry once on an ESTALE error
vfs: fix renameat to retry on ESTALE errors
vfs: make do_unlinkat retry once on ESTALE errors
vfs: make do_rmdir retry once on ESTALE errors
vfs: add a flags argument to user_path_parent
...
Linus Torvalds [Fri, 21 Dec 2012 02:05:28 +0000 (18:05 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
"sigaltstack infrastructure + conversion for x86, alpha and um,
COMPAT_SYSCALL_DEFINE infrastructure.
Note that there are several conflicts between "unify
SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
resolution is trivial - just remove definitions of SS_ONSTACK and
SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
include/uapi/linux/signal.h contains the unified variant."
Fixed up conflicts as per Al.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to generic sigaltstack
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
generic compat_sys_sigaltstack()
introduce generic sys_sigaltstack(), switch x86 and um to it
new helper: compat_user_stack_pointer()
new helper: restore_altstack()
unify SS_ONSTACK/SS_DISABLE definitions
new helper: current_user_stack_pointer()
missing user_stack_pointer() instances
Bury the conditionals from kernel_thread/kernel_execve series
COMPAT_SYSCALL_DEFINE: infrastructure
Linus Torvalds [Fri, 21 Dec 2012 01:56:23 +0000 (17:56 -0800)]
Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM fixes from Russell King:
"A number of smallish fixes scattered around the ARM code. Probably
the most serious one is the one from Al addressing the missing locking
in the swap emulation code."
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7607/1: realview: fix private peripheral memory base for EB rev. B boards
ARM: 7606/1: cache: flush to LoUU instead of LoUIS on uniprocessor CPUs
ARM: missing ->mmap_sem around find_vma() in swp_emulate.c
ARM: 7605/1: vmlinux.lds: Move .notes section next to the rodata
ARM: 7602/1: Pass real "__machine_arch_type" variable to setup_machine_tags() procedure
ARM: 7600/1: include CONFIG_DEBUG_LL_INCLUDE rather than mach/debug-macro.S
Linus Torvalds [Fri, 21 Dec 2012 01:55:34 +0000 (17:55 -0800)]
Merge tag 'fixes2' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes part 2 from Olof Johansson:
"Here are a few more fixes for 3.8. Two branches of fixes for Samsung
platforms, including fixes for the audio build errors on all non-DT
platforms. There's also a fixup to the sunxi device-tree file renames
due to a bad patch application by me, and a fix for OMAP due to
function renames merged through the powerpc tree."
* tag 'fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: OMAP2+: Fix compillation error in mach-omap2/timer.c
ARM: sunxi: rename device tree source files
ARM: EXYNOS: Avoid passing the clks through platform data
ARM: S5PV210: Avoid passing the clks through platform data
ARM: S5P64X0: Add I2S clkdev support
ARM: S5PC100: Add I2S clkdev support
ARM: S3C64XX: Add I2S clkdev support
ARM: EXYNOS: Fix MSHC clocks instance names
ARM: EXYNOS: Fix NULL pointer dereference bug in SMDKV310
ARM: EXYNOS: Fix NULL pointer dereference bug in SMDK4X12
ARM: EXYNOS: Fix NULL pointer dereference bug in Origen
ARM: SAMSUNG: Add missing include guard to gpio-core.h
pinctrl: exynos5440/samsung: Staticize pcfgs
pinctrl: samsung: Fix a typo in pinctrl-samsung.h
ARM: EXYNOS: fix skip scu_enable() for EXYNOS5440
ARM: EXYNOS: fix GIC using for EXYNOS5440
ARM: EXYNOS: fix build error when MFC is not selected
Linus Torvalds [Fri, 21 Dec 2012 01:52:06 +0000 (17:52 -0800)]
Merge branch 'misc' of git://git./linux/kernel/git/mmarek/kbuild
Pull kbuild misc changes from Michal Marek:
"This is the non-critical part of kbuild
- scripts/kernel-doc requires a "Return:" section for non-void
functions
- ARCH=arm SUBARCH=... support for make tags
- COMPILED_SOURCE=1 support for make tags (only indexes .c files for
which a .o exists)
- New coccinelle check
- Option parsing fix for scripts/config"
* 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
scripts/config: Fix wrong "shift" for --keep-case
scripts/tags.sh: Support compiled source
scripts/tags.sh: Support subarch for ARM
scripts/coccinelle/misc/warn.cocci: use WARN
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
David Howells [Thu, 20 Dec 2012 23:05:56 +0000 (15:05 -0800)]
keys: use keyring_alloc() to create module signing keyring
Use keyring_alloc() to create special keyrings now that it has
a permissions parameter rather than using key_alloc() +
key_instantiate_and_link().
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Thu, 20 Dec 2012 23:05:54 +0000 (15:05 -0800)]
keys: fix unreachable code
We set ret to NULL then test it. Remove the bogus test
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Scott Wolchok [Thu, 20 Dec 2012 23:05:52 +0000 (15:05 -0800)]
sendfile: allows bypassing of notifier events
do_sendfile() in fs/read_write.c does not call the fsnotify functions,
unlike its neighbors. This manifests as a lack of inotify ACCESS events
when a file is sent using sendfile(2).
Addresses
https://bugzilla.kernel.org/show_bug.cgi?id=12812
[akpm@linux-foundation.org: use fsnotify_modify(out.file), not fsnotify_access(), per Dave]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Scott Wolchok <swolchok@umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robin Holt [Thu, 20 Dec 2012 23:05:50 +0000 (15:05 -0800)]
SGI-XP: handle non-fatal traps
We found a user code which was raising a divide-by-zero trap. That trap
would lead to XPC connections between system-partitions being torn down
due to the die_chain notifier callouts it received.
This also revealed a different issue where multiple callers into
xpc_die_deactivate() would all attempt to do the disconnect in parallel
which would sometimes lock up but often overwhelm the console on very
large machines as each would print at least one line of output at the
end of the deactivate.
I reviewed all the users of the die_chain notifier and changed the code
to ignore the notifier callouts for reasons which will not actually lead
to a system to continue on to call die().
[akpm@linux-foundation.org: fix ia64]
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ravishankar N [Thu, 20 Dec 2012 23:05:46 +0000 (15:05 -0800)]
fat: fix incorrect function comment
fat_search_long() returns 0 on success, -ENOENT/ENOMEM on failure.
Change the function comment accordingly.
While at it, fix some trivial typos.
Signed-off-by: Ravishankar N <cyberax82@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Thu, 20 Dec 2012 23:05:45 +0000 (15:05 -0800)]
Documentation: ABI: remove testing/sysfs-devices-node
This file is already documented in the stable ABI (see commit
5bbe1ec11fcf).
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Greg KH <greg@kroah.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xiaotian Feng [Thu, 20 Dec 2012 23:05:44 +0000 (15:05 -0800)]
proc: fix inconsistent lock state
Lockdep found an inconsistent lock state when rcu is processing delayed
work in softirq. Currently, kernel is using spin_lock/spin_unlock to
protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
context.
Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.
=================================
[ INFO: inconsistent lock state ]
3.7.0 #36 Not tainted
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
{SOFTIRQ-ON-W} state was registered at:
__lock_acquire+0x8ae/0xca0
lock_acquire+0x199/0x200
_raw_spin_lock+0x41/0x50
proc_alloc_inum+0x4c/0xd0
alloc_mnt_ns+0x49/0xc0
create_mnt_ns+0x25/0x70
mnt_init+0x161/0x1c7
vfs_caches_init+0x107/0x11a
start_kernel+0x348/0x38c
x86_64_start_reservations+0x131/0x136
x86_64_start_kernel+0x103/0x112
irq event stamp: 2993422
hardirqs last enabled at (2993422): _raw_spin_unlock_irqrestore+0x55/0x80
hardirqs last disabled at (2993421): _raw_spin_lock_irqsave+0x29/0x70
softirqs last enabled at (2993394): _local_bh_enable+0x13/0x20
softirqs last disabled at (2993395): call_softirq+0x1c/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(proc_inum_lock);
<Interrupt>
lock(proc_inum_lock);
*** DEADLOCK ***
no locks held by swapper/1/0.
stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
Call Trace:
<IRQ> [<
ffffffff810a40f1>] ? vprintk_emit+0x471/0x510
print_usage_bug+0x2a5/0x2c0
mark_lock+0x33b/0x5e0
__lock_acquire+0x813/0xca0
lock_acquire+0x199/0x200
_raw_spin_lock+0x41/0x50
proc_free_inum+0x1c/0x50
free_pid_ns+0x1c/0x50
put_pid_ns+0x2e/0x50
put_pid+0x4a/0x60
delayed_put_pid+0x12/0x20
rcu_process_callbacks+0x462/0x790
__do_softirq+0x1b4/0x3b0
call_softirq+0x1c/0x30
do_softirq+0x59/0xd0
irq_exit+0x54/0xd0
smp_apic_timer_interrupt+0x95/0xa3
apic_timer_interrupt+0x72/0x80
cpuidle_enter_tk+0x10/0x20
cpuidle_enter_state+0x17/0x50
cpuidle_idle_call+0x287/0x520
cpu_idle+0xba/0x130
start_secondary+0x2b3/0x2bc
Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guenter Roeck [Thu, 20 Dec 2012 23:05:42 +0000 (15:05 -0800)]
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
Commit
263a523d18bc ("linux/kernel.h: Fix warning seen with W=1 due to
change in DIV_ROUND_CLOSEST") fixes a warning seen with W=1 due to
change in DIV_ROUND_CLOSEST.
Unfortunately, the C compiler converts divide operations with unsigned
divisors to unsigned, even if the dividend is signed and negative (for
example, -10 / 5U =
858993457). The C standard says "If one operand has
unsigned int type, the other operand is converted to unsigned int", so
the compiler is not to blame. As a result, DIV_ROUND_CLOSEST(0, 2U) and
similar operations now return bad values, since the automatic conversion
of expressions such as "0 - 2U/2" to unsigned was not taken into
account.
Fix by checking for the divisor variable type when deciding which
operation to perform. This fixes DIV_ROUND_CLOSEST(0, 2U), but still
returns bad values for negative dividends divided by unsigned divisors.
Mark the latter case as unsupported.
One observed effect of this problem is that the s2c_hwmon driver reports
a value of 4198403 instead of 0 if the ADC reads 0.
Other impact is unpredictable. Problem is seen if the divisor is an
unsigned variable or constant and the dividend is less than (divisor/2).
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Juergen Beisert <jbe@pengutronix.de>
Tested-by: Juergen Beisert <jbe@pengutronix.de>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: <stable@vger.kernel.org> [3.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Thu, 20 Dec 2012 23:05:40 +0000 (15:05 -0800)]
memcg: don't register hotcpu notifier from ->css_alloc()
Commit
648bb56d076b ("cgroup: lock cgroup_mutex in cgroup_init_subsys()")
made cgroup_init_subsys() grab cgroup_mutex before invoking
->css_alloc() for the root css. Because memcg registers hotcpu notifier
from ->css_alloc() for the root css, this introduced circular locking
dependency between cgroup_mutex and cpu hotplug.
Fix it by moving hotcpu notifier registration to a subsys initcall.
======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-work+ #42 Not tainted
-------------------------------------------------------
bash/645 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<
ffffffff8110c5b7>] cgroup_lock+0x17/0x20
but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [<
ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (cpu_hotplug.lock){+.+.+.}:
lock_acquire+0x97/0x1e0
mutex_lock_nested+0x61/0x3b0
get_online_cpus+0x3c/0x60
rebuild_sched_domains_locked+0x1b/0x70
cpuset_write_resmask+0x298/0x2c0
cgroup_file_write+0x1ef/0x300
vfs_write+0xa8/0x160
sys_write+0x52/0xa0
system_call_fastpath+0x16/0x1b
-> #0 (cgroup_mutex){+.+.+.}:
__lock_acquire+0x14ce/0x1d20
lock_acquire+0x97/0x1e0
mutex_lock_nested+0x61/0x3b0
cgroup_lock+0x17/0x20
cpuset_handle_hotplug+0x1b/0x560
cpuset_update_active_cpus+0xe/0x10
cpuset_cpu_inactive+0x47/0x50
notifier_call_chain+0x66/0x150
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x20/0x40
_cpu_down+0x7e/0x2f0
cpu_down+0x36/0x50
store_online+0x5d/0xe0
dev_attr_store+0x18/0x30
sysfs_write_file+0xe0/0x150
vfs_write+0xa8/0x160
sys_write+0x52/0xa0
system_call_fastpath+0x16/0x1b
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(cgroup_mutex);
lock(cpu_hotplug.lock);
lock(cgroup_mutex);
*** DEADLOCK ***
5 locks held by bash/645:
#0: (&buffer->mutex){+.+.+.}, at: [<
ffffffff8123bab8>] sysfs_write_file+0x48/0x150
#1: (s_active#42){.+.+.+}, at: [<
ffffffff8123bb38>] sysfs_write_file+0xc8/0x150
#2: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [<
ffffffff81079277>] cpu_hotplug_driver_lock+0x1
+7/0x20
#3: (cpu_add_remove_lock){+.+.+.}, at: [<
ffffffff81093157>] cpu_maps_update_begin+0x17/0x20
#4: (cpu_hotplug.lock){+.+.+.}, at: [<
ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60
stack backtrace:
Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42
Call Trace:
print_circular_bug+0x28e/0x29f
__lock_acquire+0x14ce/0x1d20
lock_acquire+0x97/0x1e0
mutex_lock_nested+0x61/0x3b0
cgroup_lock+0x17/0x20
cpuset_handle_hotplug+0x1b/0x560
cpuset_update_active_cpus+0xe/0x10
cpuset_cpu_inactive+0x47/0x50
notifier_call_chain+0x66/0x150
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x20/0x40
_cpu_down+0x7e/0x2f0
cpu_down+0x36/0x50
store_online+0x5d/0xe0
dev_attr_store+0x18/0x30
sysfs_write_file+0xe0/0x150
vfs_write+0xa8/0x160
sys_write+0x52/0xa0
system_call_fastpath+0x16/0x1b
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 20 Dec 2012 23:05:37 +0000 (15:05 -0800)]
checkpatch: warn on uapi #includes that #include <uapi/...
Avoid specifying internal uapi #include paths with uapi/... as
userspace should not use and never see that.
Neaten message line wrapping above.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Thu, 20 Dec 2012 23:05:34 +0000 (15:05 -0800)]
revert "rtc: recycle id when unloading a rtc driver"
Revert commit
2830a6d20139df2198d63235df7957712adb28e5.
We already perform the ida_simple_remove() in rtc_device_release(),
which is an appropriate place. Commit
2830a6d20 ("rtc: recycle id when
unloading a rtc driver") caused the kernel to emit
ida_remove called for id=0 which is not allocated.
warnings when rtc_device_release() tries to release an alread-released
ID.
Let's restore things to their previous state and then work out why
Vincent's kernel wasn't calling rtc_device_release() - presumably a bug
in a specific sub-driver.
Reported-by: Lothar Waßmann <LW@KARO-electronics.de>
Acked-by: Alexander Holler <holler@ahsoftware.de>
Cc: Vincent Palatin <vpalatin@chromium.org>
Cc: <stable@vger.kernel.org> [3.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeremy Eder [Thu, 20 Dec 2012 23:05:32 +0000 (15:05 -0800)]
mm: clean up transparent hugepage sysfs error messages
Clarify error messages and correct a few typos in the transparent hugepage
sysfs init code.
Signed-off-by: Jeremy Eder <jeder@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vyacheslav Dubeyko [Thu, 20 Dec 2012 23:05:29 +0000 (15:05 -0800)]
hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
Add an error message for the case of failure of sync fs in
delayed_sync_fs() method.
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vyacheslav Dubeyko [Thu, 20 Dec 2012 23:05:28 +0000 (15:05 -0800)]
hfsplus: rework processing of hfs_btree_write() returned error
Add to hfs_btree_write() a return of -EIO on failure of b-tree node
searching. Also add logic ofor processing errors from hfs_btree_write()
in hfsplus_system_write_inode() with a message about b-tree writing
failure.
[akpm@linux-foundation.org: reduce scope of `err', print errno on error]
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>