Kent Overstreet [Wed, 24 Jul 2013 23:46:42 +0000 (16:46 -0700)]
bcache: Don't bother with bucket refcount for btree node allocations
The bucket refcount (dropped with bkey_put()) is only needed to prevent
the newly allocated bucket from being garbage collected until we've
added a pointer to it somewhere. But for btree node allocations, the
fact that we have btree nodes locked is enough to guard against races
with garbage collection.
Eventually the per bucket refcount is going to be replaced with
something specific to bch_alloc_sectors().
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 24 Oct 2013 23:36:03 +0000 (16:36 -0700)]
bcache: Debug code improvements
Couple changes:
* Consolidate bch_check_keys() and bch_check_key_order(), and move the
checks that only check_key_order() could do to bch_btree_iter_next().
* Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in
when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to
flip on the EDEBUG checks at runtime.
* Dropped an old not terribly useful check in rw_unlock(), and
refactored/improved a some of the other debug code.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 01:14:44 +0000 (18:14 -0700)]
bcache: Fix bch_ptr_bad()
Previously, bch_ptr_bad() could return false when there was a pointer to
a nonexistant device... it only filtered out keys with PTR_CHECK_DEV
pointers.
This behaviour was intended for multiple cache device support; for that,
just because the device for one of the pointers has gone away doesn't
mean we want to filter out the rest of the pointers.
But we don't yet explicitly filter/check individual pointers, so without
that this behaviour was wrong - a corrupt bkey with a bad device pointer
could cause us to deref a bad pointer. Doh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 31 Oct 2013 22:46:42 +0000 (15:46 -0700)]
bcache: Pull on disk data structures out into a separate header
Now, the on disk data structures are in a header that can be exported to
userspace - and having them all centralized is nice too.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 01:11:11 +0000 (18:11 -0700)]
bcache: Move sector allocator to alloc.c
Just reorganizing things a bit.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 02:02:45 +0000 (19:02 -0700)]
bcache: Break up struct search
With all the recent refactoring around struct btree op struct search has
gotten rather large.
But we can now easily break it up in a different way - we break out
struct btree_insert_op which is for inserting data into the cache, and
that's now what the copying gc code uses - struct search is now specific
to request.c
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 01:07:22 +0000 (18:07 -0700)]
bcache: Convert bch_btree_insert() to bch_btree_map_leaf_nodes()
Last of the btree_map() conversions. Main visible effect is
bch_btree_insert() is no longer taking a struct btree_op as an argument
anymore - there's no fancy state machine stuff going on, it's just a
normal function.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 01:06:22 +0000 (18:06 -0700)]
bcache: Don't use op->insert_collision
When we convert bch_btree_insert() to bch_btree_map_leaf_nodes(), we
won't be passing struct btree_op to bch_btree_insert() anymore - so we
need a different way of returning whether there was a collision (really,
a replace collision).
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 01:52:54 +0000 (18:52 -0700)]
bcache: Kill op->replace
This is prep work for converting bch_btree_insert to
bch_btree_map_leaf_nodes() - we have to convert all its arguments to
actual arguments. Bunch of churn, but should be straightforward.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Sat, 2 Nov 2013 01:03:08 +0000 (18:03 -0700)]
bcache: Drop some closure stuff
With a the recent bcache refactoring, some of the closure code isn't
needed anymore.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 01:04:18 +0000 (18:04 -0700)]
bcache: Kill op->cl
This isn't used for waiting asynchronously anymore - so this is a fairly
trivial refactoring.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:44:17 +0000 (17:44 -0700)]
bcache: Prune struct btree_op
Eventual goal is for struct btree_op to contain only what is necessary
for traversing the btree.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:41:13 +0000 (17:41 -0700)]
bcache: Clean up cache_lookup_fn
There was some looping in submit_partial_cache_hit() and
submit_partial_cache_hit() that isn't needed anymore - originally, we
wouldn't necessarily process the full hit or miss all at once because
when splitting the bio, we took into account the restrictions of the
device we were sending it to.
But, device bio size restrictions are now handled elsewhere, with a
wrapper around generic_make_request() - so that looping has been
unnecessary for awhile now and we can now do quite a bit of cleanup.
And if we trim the key we're reading from to match the subset we're
actually reading, we don't have to explicitly calculate bi_sector
anymore. Neat.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:41:08 +0000 (17:41 -0700)]
bcache: Convert bch_btree_read_async() to bch_btree_map_keys()
This is a fairly straightforward conversion, mostly reshuffling -
op->lookup_done goes away, replaced by MAP_DONE/MAP_CONTINUE. And the
code for handling cache hits and misses wasn't really btree code, so it
gets moved to request.c.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:37:59 +0000 (17:37 -0700)]
bcache: Move some stuff to btree.c
With the new btree_map() functions, we don't need to export the stuff
needed for traversing the btree anymore.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 01:48:51 +0000 (18:48 -0700)]
bcache: Add btree_map() functions
Lots of stuff has been open coding its own btree traversal - which is
generally pretty simple code, but there are a few subtleties.
This adds new new functions, bch_btree_map_nodes() and
bch_btree_map_keys(), which do the traversal for you. Everything that's
open coding btree traversal now (with the exception of garbage
collection) is slowly going to be converted to these two functions;
being able to write other code at a higher level of abstraction is a
big improvement w.r.t. overall code quality.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:50:06 +0000 (17:50 -0700)]
bcache: Convert writeback to a kthread
This simplifies the writeback flow control quite a bit - previously, it
was conceptually two coroutines, refill_dirty() and read_dirty(). This
makes the code quite a bit more straightforward.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Fri, 25 Oct 2013 00:19:26 +0000 (17:19 -0700)]
bcache: Convert gc to a kthread
We needed a dedicated rescuer workqueue for gc anyways... and gc was
conceptually a dedicated thread, just one that wasn't running all the
time. Switch it to a dedicated thread to make the code a bit more
straightforward.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:29:09 +0000 (17:29 -0700)]
bcache: Convert bucket_wait to wait_queue_head_t
At one point we did do fancy asynchronous waiting stuff with
bucket_wait, but that's all gone (and bucket_wait is used a lot less
than it used to be). So use the standard primitives.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:27:07 +0000 (17:27 -0700)]
bcache: Convert try_wait to wait_queue_head_t
We never waited on c->try_wait asynchronously, so just use the standard
primitives.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:26:51 +0000 (17:26 -0700)]
bcache: Move keylist out of btree_op
Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Fri, 25 Oct 2013 00:07:04 +0000 (17:07 -0700)]
bcache: Refactor journalling flow control
Making things less asynchronous that don't need to be - bch_journal()
only has to block when the journal or journal entry is full, which is
emphatically not a fast path. So make it a normal function that just
returns when it finishes, to make the code and control flow easier to
follow.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 00:06:17 +0000 (17:06 -0700)]
bcache: Refactor read request code a bit
More refactoring, and renaming.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:24:52 +0000 (17:24 -0700)]
bcache: Refactor request_write()
Try to improve some of the naming a bit to be more consistent, and also
improve the flow of control in request_write() a bit.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:24:25 +0000 (17:24 -0700)]
bcache: Clean up keylist code
More random refactoring.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 01:46:36 +0000 (18:46 -0700)]
bcache: Add explicit keylist arg to btree_insert()
Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 01:39:16 +0000 (18:39 -0700)]
bcache: Convert btree_insert_check_key() to btree_insert_node()
This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:22:44 +0000 (17:22 -0700)]
bcache: Insert multiple keys at a time
We'll often end up with a list of adjacent keys to insert -
because bch_data_insert() may have to fragment the data it writes.
Originally, to simplify things and avoid having to deal with corner
cases bch_btree_insert() would pass keys from this list one at a time to
btree_insert_recurse() - mainly because the list of keys might span leaf
nodes, so it was easier this way.
With the btree_insert_node() refactoring, it's now a lot easier to just
pass down the whole list and have btree_insert_recurse() iterate over
leaf nodes until it's done.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Wed, 11 Sep 2013 01:41:15 +0000 (18:41 -0700)]
bcache: Add btree_insert_node()
The flow of control in the old btree insertion code was rather -
backwards; we'd recurse down the btree (in btree_insert_recurse()), and
then if we needed to split the keys to be inserted into the parent node
would be effectively returned up to btree_insert_recurse(), which would
notice there was more work to do and finish the insertion.
The main problem with this was that the full logic for btree insertion
could only be used by calling btree_insert_recurse; if you'd gotten to a
btree leaf some other way and had a key to insert, if it turned out that
node needed to be split you were SOL.
This inverts the flow of control so btree_insert_node() does _full_
btree insertion, including splitting - and takes a (leaf) btree node to
insert into as a parameter.
This means we can now _correctly_ handle cache misses - for cache
misses, we need to insert a fake "check" key into the btree when we
discover we have a cache miss - while we still have the btree locked.
Previously, if the btree node was full inserting a cache miss would just
fail.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:20:19 +0000 (17:20 -0700)]
bcache: Explicitly track btree node's parent
This is prep work for the reworked btree insertion code.
The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.
I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:20:00 +0000 (17:20 -0700)]
bcache: Remove unnecessary check in should_split()
Checking i->seq was redundant, because since ages ago we always
initialize the new bset when advancing b->written
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Sat, 17 Aug 2013 09:13:15 +0000 (02:13 -0700)]
bcache: Stripe size isn't necessarily a power of two
Originally I got this right... except that the divides didn't use
do_div(), which broke 32 bit kernels. When I went to fix that, I forgot
that the raid stripe size usually isn't a power of two... doh
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Fri, 12 Jul 2013 02:42:51 +0000 (19:42 -0700)]
bcache: Add on error panic/unregister setting
Works kind of like the ext4 setting, to panic or remount read only on
errors.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Thu, 25 Jul 2013 00:16:09 +0000 (17:16 -0700)]
bcache: Use blkdev_issue_discard()
The old asynchronous discard code was really a relic from when all the
allocation code was asynchronous - now that allocation runs out of a
dedicated thread there's no point in keeping around all that complicated
machinery.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Fri, 25 Oct 2013 00:12:52 +0000 (17:12 -0700)]
bcache: Fix a lockdep splat
bch_keybuf_del() takes a spinlock that can't be taken in interrupt context -
whoops. Fortunately, this code isn't enabled by default (you have to toggle a
sysfs thing).
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Kent Overstreet [Tue, 8 Oct 2013 22:50:46 +0000 (15:50 -0700)]
bcache: Fix a journalling performance bug
Kent Overstreet [Mon, 11 Nov 2013 05:55:27 +0000 (21:55 -0800)]
bcache: Fix dirty_data accounting
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Ben Harris [Fri, 18 Oct 2013 20:23:35 +0000 (21:23 +0100)]
floppy: Correct documentation of driver options when used as a module.
The options have to be passed space-separated and prefixed by "floppy=",
rather than separately and unprefixed.
This fixes <http://bugs.debian.org/726655>.
Signed-off-by: Ben Harris <bjh21@cam.ac.uk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Dan Carpenter [Wed, 6 Nov 2013 08:24:02 +0000 (09:24 +0100)]
pktcdvd: debugfs functions return NULL on error
My static checker complains correctly that this is potential NULL
dereference because debugfs functions return NULL on error. They return
an ERR_PTR if they are configured out.
We don't need to check for ERR_PTR because if debugfs is stubbed out the
dummy functions won't complain about that. We don't need to check the
values before calling debugfs_remove() because that accepts ERR_PTRs and
NULL pointers.
We don't need to set pkt->dfs_f_info to NULL in pkt_debugfs_dev_new()
because it was initialized with kzalloc() so I have removed that.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Roger Pau Monne [Tue, 29 Oct 2013 17:31:14 +0000 (18:31 +0100)]
xen-blkfront: restore the non-persistent data path
When persistent grants were added they were always used, even if the
backend doesn't have this feature (there's no harm in always using the
same set of pages). This restores the old data path when the backend
doesn't have persistent grants, removing the burden of doing a memcpy
when it is not actually needed.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Felipe Franciosi <felipe.franciosi@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Fix up whitespace issues]
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:09 +0000 (12:37 +0100)]
skd: fix formatting in skd_s1120.h
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:08 +0000 (12:37 +0100)]
skd: reorder construct/destruct code
Reorder placement of skd_construct(), skd_cons_sg_list(), skd_destruct()
and skd_free_sg_list() functions. Then remove no longer needed function
prototypes.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:07 +0000 (12:37 +0100)]
skd: cleanup skd_do_inq_page_da()
skdev->pdev and skdev->pdev->bus are always different than NULL in
skd_do_inq_page_da() so simplify the code accordingly.
Also cache skdev->pdev value in pdev variable while at it.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:06 +0000 (12:37 +0100)]
skd: remove SKD_OMIT_FROM_SRC_DIST ifdefs
SKD_OMIT_FROM_SRC_DIST is never defined.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:05 +0000 (12:37 +0100)]
skd: remove redundant skdev->pdev assignment from skd_pci_probe()
skdev->pdev is set to pdev twice in skd_pci_probe(), first time
through skd_construct() call and the second time directly in
the function. Remove the second assignment as it is not needed.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:04 +0000 (12:37 +0100)]
skd: use <asm/unaligned.h>
Use <asm/unaligned.h> instead of <asm-generic/unaligned.h>.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:03 +0000 (12:37 +0100)]
skd: remove SCSI subsystem specific includes
This is not a SCSI host driver so remove SCSI subsystem specific
includes.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:02 +0000 (12:37 +0100)]
skd: register block device only if some devices are present
Register block device in skd_pci_probe() instead of in skd_init() so it
is registered only if some devices are present (currently it is always
registered when the driver is loaded). Please note that this change
depends on the fact that register_blkdev(0, ...) never returns 0.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:01 +0000 (12:37 +0100)]
skd: fix error messages in skd_init()
* change priority level from KERN_INFO to KERN_ERR
* add "skd: " prefix
* do minor CodingStyle fixes
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:37:00 +0000 (12:37 +0100)]
skd: fix error paths in skd_init()
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bartlomiej Zolnierkiewicz [Tue, 5 Nov 2013 11:36:59 +0000 (12:36 +0100)]
skd: fix unregister_blkdev() placement
register_blkdev() is called before pci_register_driver() in skd_init()
so unregister_blkdev() should be called after pci_unregister_driver()
in skd_exit(). Fix it.
Cc: Akhil Bhansali <abhansali@stec-inc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Snitzer [Fri, 1 Nov 2013 19:05:10 +0000 (15:05 -0400)]
skd: more removal of bio-based code
Remove skd_flush_cmd structure and skd_flush_slab.
Remove skd_end_request wrapper around skd_end_request_blk.
Remove skd_requeue_request, use blk_requeue_request directly.
Cleanup some comments (remove "bio" info) and whitespace.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 1 Nov 2013 16:38:45 +0000 (10:38 -0600)]
skd: cleanup the skd_*() function block wrapping
Just call the block functions directly, don't wrap them
in skd helpers. With only one queueing model enabled, there's
no point in doing that.
Also kill the ->start_time and ->bio from the skd_request_context,
we don't use those anymore.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 1 Nov 2013 16:14:56 +0000 (10:14 -0600)]
skd: rip out bio path
The skd driver has a selectable rq or bio based queueing model.
For 3.14, we want to turn this into a single blk-mq interface
instead. With the immutable biovecs being merged in 3.13, the
bio model would need patches to even work. So rip it out, with
a conversion pending for blk-mq in the next release.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wei Yongjun [Wed, 30 Oct 2013 05:23:53 +0000 (13:23 +0800)]
skd: fix error return code in skd_pci_probe()
Fix to return -ENOMEM in the skd construct error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Heiko Carstens [Thu, 31 Oct 2013 12:24:28 +0000 (13:24 +0100)]
s390/dasd: hold request queue sysfs lock when calling elevator_init()
"elevator: Fix a race in elevator switching and md device initialization"
changed the semantics of elevator_init() in a way that now enforces to hold
the corresponding request queue's sysfs_lock when calling elevator_init()
to fix a race.
The patch did not convert the s390 dasd device driver which is the only
device driver which also calls elevator_init(). So add the missing locking.
Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stephen M. Cameron [Tue, 29 Oct 2013 19:46:06 +0000 (13:46 -0600)]
cciss: return 0 from driver probe function on success, not 1
A return value of 1 is interpreted as an error
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rchinthekindi [Thu, 24 Oct 2013 11:51:23 +0000 (12:51 +0100)]
skd: Replaced custom debug PRINTKs with pr_debug
Replaced DPRINTK() and VPRINTK() with pr_debug().
Signed-off-by: Ramprasad C <ramprasad.chinthekindi@hgst.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Akhil Bhansali [Wed, 23 Oct 2013 12:00:08 +0000 (13:00 +0100)]
skd: Fix checkpatch ERRORS and removed unused functions
This patch fixes checkpatch.pl errors for assignment in if condition.
It also removes unused readq / readl function calls.
As Andrew had disabled the compilation of drivers for 32 bit,
I have modified format specifiers in few VPRINTKs to avoid warnings
during 64 bit compilation.
Signed-off-by: Akhil Bhansali <abhansali@stec-inc.com>
Reviewed-by: Ramprasad Chinthekindi <rchinthekindi@stec-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philip J Kelleher [Fri, 18 Oct 2013 22:12:35 +0000 (17:12 -0500)]
rsxx: Fix possible kernel panic with invalid config.
This patch fixes a possible Kernel Panic on driver load if
the configuration on the card is messed up or not yet set.
The driver could possible give a 32 bit unsigned all Fs to
the kernel as the device's block size.
Now we only write the block size to the kernel if the
configuration from the card is valid.
Also, driver version is being updated.
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philip J Kelleher [Fri, 18 Oct 2013 22:11:46 +0000 (17:11 -0500)]
rsxx: Disallow discards from being unmapped.
This patch fixes a bug in which discards were always
calling pci_unmap_page. Discards should never call the
pci_unmap_page function call because they are never mapped.
This caused a race condition on PowerPC systems when issuing
discards, writes, and reads all at the same time. The
pci_map_page function would eventually map logical address
0 for a read or write. Discards are always assigned a DMA
address of 0 because they are never mapped. So if
pci_map_page mapped address 0 for a DMA and a discard was
"unmapped" then the address would be freed and would cause
an EEH event to occur when Hardware accesses the address.
This was injected/uncovered in commit:
b347f9cf0bc8d42ee95ba1d3837fd93045ab336b
The pci_dma_mapping_error function declares -1 a DMA_ERROR
not 0 like initially thought So before we would never unmap
discards because they were considered NULL.
This patch should fall on top of commit id:
fc1967bb08a6184ed44ef990e1dd4389901b809c
Also, the driver version is being up dated.
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Wed, 23 Oct 2013 08:59:19 +0000 (10:59 +0200)]
drbd: avoid to shrink max_bio_size due to peer re-configuration
For a long time, the receiving side has spread "too large" incoming
requests over multiple bios. No need to shrink our max_bio_size
(max_hw_sectors) if the peer is reconfigured to use a different storage.
The problem manifests itself if we are not the top of the device stack
(DRBD is used a LVM PV).
A hardware reconfiguration on the peer may cause the supported
max_bio_size to shrink, and the connection handshake would now
unnecessarily shrink the max_bio_size on the active node.
There is no way to notify upper layers that they have to "re-stack"
their limits. So they won't notice at all, and may keep submitting bios
that are suddenly considered "too large for device".
We already check for compatibility and ignore changes on the peer,
the code only was masked out unless we have a fully established connection.
We just need to allow it a bit earlier during the handshake.
Also consider max_hw_sectors in our merge bvec function, just in case.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Wed, 23 Oct 2013 08:59:18 +0000 (10:59 +0200)]
drbd: fix decoding of bitmap vli rle for device sizes > 64 TB
Symptoms: disconnect after bitmap exchange due to
bitmap overflow (e:
49731075554) while decoding bm RLE packet
In the decoding step of the variable length integer run length encoding
there was potentially an uncatched bitshift by wordsize (variable >> 64).
The result of which is "undefined" :(
(only "sometimes" the result is the desired 0)
Fix: don't do any bit shift magic for shift == 64, just assign.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philipp Reisner [Wed, 23 Oct 2013 08:59:17 +0000 (10:59 +0200)]
drbd: Fix adding of new minors with freshly created meta data
Online adding of new minors with freshly created meta data
to an resource with an established connection failed, with a
wrong state transition on one side on one side of the new minor.
Freshly created meta-data has a la_size (last agreed size) of 0.
When we online add such devices, the code wrongly got into
the code path for resyncing new storage that was added while
the disk was detached.
Fixed that by making the GREW from ZERO a special case.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philipp Reisner [Wed, 23 Oct 2013 08:59:16 +0000 (10:59 +0200)]
drbd: Fix an connection drop issue after enabling allow-two-primaries
Since drbd-8.4.0 it is possible to change the allow-two-primaries
network option while the connection is established.
The sequence code used to partially order packets from the
data socket with packets from the meta-data socket, still assued
that the allow-two-primaries option is constant while the
connection is established.
I.e.
On a node that has the RESOLVE_CONFLICTS bits set, after enabling
allow-two-primaries, when receiving the next data packet it timed out
while waiting for the necessary packets on the data socket to arrive
(wait_for_and_update_peer_seq() function).
Fixed that by always tracking the sequence number, but only waiting
for it if allow-two-primaries is set.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Wed, 23 Oct 2013 08:59:15 +0000 (10:59 +0200)]
drbd: fix NULL pointer deref in module init error path
If we want to iterate over the (as of yet still empty) list in the
cleanup path, we need to initialize the list before the first goto fail.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 17 Oct 2013 22:38:30 +0000 (16:38 -0600)]
block: disable cpqarray in Kconfig
Mike writes:
"cpqarray hasn't been used in over 12 years. It's doubtful that anyone
still uses the board. It's time the driver was removed from the mainline
kernel. The only updates these days are minor and mostly done by people
outside of HP."
If nobody yells, we'll remove it from the kernel tree completely
for 3.15.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Akhil Bhansali [Tue, 15 Oct 2013 20:19:07 +0000 (14:19 -0600)]
Add support for sTec's pci-e flash card Kronos
Signed-off-by: Akhil Bhansali <abhansali@stec-inc.com>
Signed-off-by: Ramprasad Chinthekindi <rchinthekindi@stec-inc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Folded patch, contributions to clean up this driver from:
Jens Axboe
Dan Carpenter
Andrew Morton
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philip J Kelleher [Sat, 28 Sep 2013 02:42:50 +0000 (20:42 -0600)]
rsxx: Kernel Panic caused by mapping Discards
This fixes a kernel panic injected by commit id
8d26750143341831bc312f61c5ed141eeb75b8d0 where discards
are getting mapped through the pci_map_page function call.
The driver will now start verifying that a dma is not a
discard before issuing a the pci_map_page function call.
Also, we are updating the driver version.
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Milburn [Thu, 23 May 2013 21:23:45 +0000 (16:23 -0500)]
mtip32xx: dynamically allocate buffer in debugfs functions
Dynamically allocate buf to prevent warnings:
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_device_status’:
drivers/block/mtip32xx/mtip32xx.c:2823: warning: the frame size of 1056 bytes is larger than 1024 bytes
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_registers’:
drivers/block/mtip32xx/mtip32xx.c:2894: warning: the frame size of 1056 bytes is larger than 1024 bytes
drivers/block/mtip32xx/mtip32xx.c: In function ‘mtip_hw_read_flags’:
drivers/block/mtip32xx/mtip32xx.c:2917: warning: the frame size of 1056 bytes is larger than 1024 bytes
Signed-off-by: David Milburn <dmilburn@redhat.com>
Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Asai Thambi S P [Wed, 11 Sep 2013 19:14:42 +0000 (13:14 -0600)]
mtip32xx: Add SRSI support
This patch add support for SRSI(Surprise Removal Surprise Insertion).
Approach:
---------
Surprise Removal:
-----------------
On surprise removal of the device, gendisk, request queue, device index, sysfs
entries, etc are retained as long as device is in use - mounted filesystem,
device opened by an application, etc. The service thread breaks out of the main
while loop, waits for pci remove to exit, and then waits for device to become
free. When there no holders of the device, service thread cleans up the block
and device related stuff and returns.
Surprise Insertion:
-------------------
No change, this scenario follows the normal pci probe() function flow.
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philip J Kelleher [Wed, 4 Sep 2013 18:59:35 +0000 (13:59 -0500)]
rsxx: Moving pci_map_page to prevent overflow.
The pci_map_page function has been moved into our
issued workqueue to prevent an us running out of
mappable addresses on non-HWWD PCIe x8 slots. The
maximum amount that can possible be mapped at one
time now is: 255 dmas X 4 dma channels X 4096 Bytes.
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philip J Kelleher [Wed, 4 Sep 2013 18:59:02 +0000 (13:59 -0500)]
rsxx: Handling failed pci_map_page on PowerPC and double free.
The rsxx driver was not checking the correct value during a
pci_map_page failure. Fixing this also uncovered a
double free if the bio was returned before it was
broken up into indiviadual 4k dmas, that is also
fixed here.
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mikulas Patocka [Tue, 15 Oct 2013 20:14:38 +0000 (14:14 -0600)]
loop: fix crash when using unassigned loop device
When the loop module is loaded, it creates 8 loop devices /dev/loop[0-7].
The devices have no request routine and thus, when they are used without
being assigned, a crash happens.
For example, these commands cause crash (assuming there are no used loop
devices):
Kernel Fault: Code=26 regs=
000000007f420980 (Addr=
0000000000000010)
CPU: 1 PID: 50 Comm: kworker/1:1 Not tainted 3.11.0 #1
Workqueue: ksnaphd do_metadata [dm_snapshot]
task:
000000007fcf4078 ti:
000000007f420000 task.ti:
000000007f420000
[ 116.319988]
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW:
00001000000001001111111100001111 Not tainted
r00-03
000000ff0804ff0f 00000000408bf5d0 00000000402d8204 000000007b7ff6c0
r04-07
00000000408a95d0 000000007f420950 000000007b7ff6c0 000000007d06c930
r08-11
000000007f4205c0 0000000000000001 000000007f4205c0 000000007f4204b8
r12-15
0000000000000010 0000000000000000 0000000000000000 0000000000000000
r16-19
000000001108dd48 000000004061cd7c 000000007d859800 000000000800000f
r20-23
0000000000000000 0000000000000008 0000000000000000 0000000000000000
r24-27
00000000ffffffff 000000007b7ff6c0 000000007d859800 00000000408a95d0
r28-31
0000000000000000 000000007f420950 000000007f420980 000000007f4208e8
sr00-03
0000000000000000 0000000000000000 0000000000000000 0000000000303000
sr04-07
0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 117.549988]
IASQ:
0000000000000000 0000000000000000 IAOQ:
00000000402d82fc 00000000402d8300
IIR:
53820020 ISR:
0000000000000000 IOR:
0000000000000010
CPU: 1 CR30:
000000007f420000 CR31:
ffffffffffffffff
ORIG_R28:
0000000000000001
IAOQ[0]: generic_make_request+0x11c/0x1a0
IAOQ[1]: generic_make_request+0x120/0x1a0
RP(r2): generic_make_request+0x24/0x1a0
Backtrace:
[<
00000000402d83f0>] submit_bio+0x70/0x140
[<
0000000011087c4c>] dispatch_io+0x234/0x478 [dm_mod]
[<
0000000011087f44>] sync_io+0xb4/0x190 [dm_mod]
[<
00000000110883bc>] dm_io+0x2c4/0x310 [dm_mod]
[<
00000000110bfcd0>] do_metadata+0x28/0xb0 [dm_snapshot]
[<
00000000401591d8>] process_one_work+0x160/0x460
[<
0000000040159bc0>] worker_thread+0x300/0x478
[<
0000000040161a70>] kthread+0x118/0x128
[<
0000000040104020>] end_fault_vector+0x20/0x28
[<
0000000040177220>] task_tick_fair+0x420/0x4d0
[<
00000000401aa048>] invoke_rcu_core+0x50/0x60
[<
00000000401ad5b8>] rcu_check_callbacks+0x210/0x8d8
[<
000000004014aaa0>] update_process_times+0xa8/0xc0
[<
00000000401ab86c>] rcu_process_callbacks+0x4b4/0x598
[<
0000000040142408>] __do_softirq+0x250/0x2c0
[<
00000000401789d0>] find_busiest_group+0x3c0/0xc70
[ 119.379988]
Kernel panic - not syncing: Kernel Fault
Rebooting in 1 seconds..
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Vegard Nossum [Thu, 5 Sep 2013 11:00:14 +0000 (13:00 +0200)]
xen/blkback: fix reference counting
If the permission check fails, we drop a reference to the blkif without
having taken it in the first place. The bug was introduced in commit
604c499cbbcc3d5fe5fb8d53306aa0fae1990109 (xen/blkback: Check device
permissions before allowing OP_DISCARD).
Cc: stable@vger.kernel.org
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Roger Pau Monne [Mon, 12 Aug 2013 10:53:43 +0000 (12:53 +0200)]
xen-blkfront: improve aproximation of required grants per request
Improve the calculation of required grants to process a request by
using nr_phys_segments instead of always assuming a request is going
to use all posible segments.
nr_phys_segments contains the number of scatter-gather DMA addr+len
pairs, which is basically what we put at every granted page.
for_each_sg iterates over the DMA addr+len pairs and uses a grant
page for each of them.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Roger Pau Monne [Mon, 12 Aug 2013 10:53:44 +0000 (12:53 +0200)]
xen-blkfront: revoke foreign access for grants not mapped by the backend
There's no need to keep the foreign access in a grant if it is not
persistently mapped by the backend. This allows us to free grants that
are not mapped by the backend, thus preventing blkfront from hoarding
all grants.
The main effect of this is that blkfront will only persistently map
the same grants as the backend, and it will always try to use grants
that are already mapped by the backend. Also the number of persistent
grants in blkfront is the same as in blkback (and is controlled by the
value in blkback).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Matt Wilson <msw@amazon.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Michael Opdenacker [Sat, 12 Oct 2013 04:33:47 +0000 (06:33 +0200)]
mg_disk: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Duan Jiong [Wed, 6 Nov 2013 07:56:39 +0000 (15:56 +0800)]
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Duan Jiong [Wed, 6 Nov 2013 07:55:44 +0000 (15:55 +0800)]
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Geert Uytterhoeven [Mon, 4 Nov 2013 13:00:06 +0000 (14:00 +0100)]
block: Do not call sector_div() with a 64-bit divisor
do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused
issues in the past on 32-bit platforms, cfr. commit
ea077b1b96e073eac5c3c5590529e964767fc5f7 ("m68k: Truncate base in
do_div()").
As queue_limits.max_discard_sectors and .discard_granularity are unsigned
int, max_discard_sectors and granularity should be unsigned int.
As bdev_discard_alignment() returns int, alignment should be int.
Now 2 calls to sector_div() can be replaced by 32-bit arithmetic:
- The 64-bit modulo operation can become a 32-bit modulo operation,
- The 64-bit division and multiplication can be replaced by a 32-bit
modulo operation and a subtraction.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chen Gang [Sun, 3 Nov 2013 14:23:39 +0000 (22:23 +0800)]
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
do_blk_trace_setup() will fully initialize 'buts.name', so can remove
the related memcpy(). And also use BLKTRACE_BDEV_SIZE and ARRAY_SIZE
instead of hard code number '32'.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kent Overstreet [Wed, 7 Aug 2013 18:14:32 +0000 (11:14 -0700)]
block: Consolidate duplicated bio_trim() implementations
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kent Overstreet [Wed, 7 Aug 2013 21:20:17 +0000 (14:20 -0700)]
block: Use rw_copy_check_uvector()
No need for silly open coding - and struct sg_iovec has exactly the same
layout as struct iovec...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Alireza Haghdoost [Wed, 23 Oct 2013 16:08:16 +0000 (17:08 +0100)]
block: Enable sysfs nomerge control for I/O requests in the plug list
This patch enables the sysfs to control I/O request merge
functionality in the plug list. While this control has been
implemented for the request queue, it was dismissed in the plug list.
Therefore, block layer merges requests together (or attempt to merge)
even if the merge capability was disable using sysfs nomerge parameter
value 2.
This limitation is directly affects functionality of io_submit()
system call. The system call enables user to submit a bunch of IO
requests from user space using struct iocb **ios input argument.
However, the unconditioned merging functionality in the plug list
potentially merges these requests together down the road. Therefore,
there is no way to distinguish between an application sending bunch of
sequential IOs and an application sending one big IO. Ultimately, all
requests generated by the former app merge within the plug list
together and looks similar to the second app.
While the merging functionality is a desirable feature to improve the
performance of IO subsystem for some applications, it is not useful
for other application like ours at all.
Signed-off-by: Alireza Haghdoost <alireza@cs.umn.edu>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Coding style modified.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mike Snitzer [Fri, 18 Oct 2013 15:44:49 +0000 (09:44 -0600)]
block: properly stack underlying max_segment_size to DM device
Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
(65536) even if the underlying device(s) have a larger value -- this is
due to blk_stack_limits() using min_not_zero() when stacking the
max_segment_size limit.
1073741824
before patch:
65536
after patch:
1073741824
Reported-by: Lukasz Flis <l.flis@cyfronet.pl>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tomoki Sekiyama [Tue, 15 Oct 2013 22:42:19 +0000 (16:42 -0600)]
elevator: acquire q->sysfs_lock in elevator_change()
Add locking of q->sysfs_lock into elevator_change() (an exported function)
to ensure it is held to protect q->elevator from elevator_init(), even if
elevator_change() is called from non-sysfs paths.
sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
version, as the lock is already taken by elv_iosched_store().
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tomoki Sekiyama [Tue, 15 Oct 2013 22:42:16 +0000 (16:42 -0600)]
elevator: Fix a race in elevator switching and md device initialization
The soft lockup below happens at the boot time of the system using dm
multipath and the udev rules to switch scheduler.
[ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[ 356.127001] RIP: 0010:[<
ffffffff81072a7d>] [<
ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
...
[ 356.127001] Call Trace:
[ 356.127001] [<
ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
[ 356.127001] [<
ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
[ 356.127001] [<
ffffffff810738b2>] del_timer_sync+0x52/0x60
[ 356.127001] [<
ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
[ 356.127001] [<
ffffffff812c98df>] elevator_exit+0x2f/0x50
[ 356.127001] [<
ffffffff812c9f21>] elevator_change+0xf1/0x1c0
[ 356.127001] [<
ffffffff812caa50>] elv_iosched_store+0x20/0x50
[ 356.127001] [<
ffffffff812d1d09>] queue_attr_store+0x59/0xb0
[ 356.127001] [<
ffffffff812143f6>] sysfs_write_file+0xc6/0x140
[ 356.127001] [<
ffffffff811a326d>] vfs_write+0xbd/0x1e0
[ 356.127001] [<
ffffffff811a3ca9>] SyS_write+0x49/0xa0
[ 356.127001] [<
ffffffff8164e899>] system_call_fastpath+0x16/0x1b
This is caused by a race between md device initialization by multipathd and
shell script to switch the scheduler using sysfs.
- multipathd:
SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
-> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
q->elevator = elevator_alloc(q, e); // not yet initialized
- sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
elevator_switch (in the call trace above)
struct elevator_queue *old = q->elevator;
q->elevator = elevator_alloc(q, new_e);
elevator_exit(old); // lockup! (*)
- multipathd: (cont.)
err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified
(*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
while timer->base == NULL. In this case, as timer will never initialized,
it results in lockup.
This patch introduces acquisition of q->sysfs_lock around elevator_init()
into blk_init_allocated_queue(), to provide mutual exclusion between
initialization of the q->scheduler and switching of the scheduler.
This should fix this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=902012
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Lameter [Tue, 15 Oct 2013 18:22:29 +0000 (12:22 -0600)]
block: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
this_cpu_inc(y)
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mikulas Patocka [Mon, 14 Oct 2013 16:14:13 +0000 (12:14 -0400)]
bdi: test bdi_init failure
There were two places where return value from bdi_init was not tested.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mikulas Patocka [Mon, 14 Oct 2013 16:13:24 +0000 (12:13 -0400)]
block: fix a probe argument to blk_register_region
The probe function is supposed to return NULL on failure (as we can see in
kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;
However, in loop and brd, it returns negative error from ERR_PTR.
This causes a crash if we simulate disk allocation failure and run
less -f /dev/loop0 because the negative number is interpreted as a pointer:
BUG: unable to handle kernel NULL pointer dereference at
00000000000002b4
IP: [<
ffffffff8118b188>] __blkdev_get+0x28/0x450
PGD
23c677067 PUD
23d6d1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
CPU: 1 PID: 6831 Comm: less Tainted: P W O 3.10.15-devel #18
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
task:
ffff880203cc6bc0 ti:
ffff88023e47c000 task.ti:
ffff88023e47c000
RIP: 0010:[<
ffffffff8118b188>] [<
ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP: 0018:
ffff88023e47dbd8 EFLAGS:
00010286
RAX:
ffffffffffffff74 RBX:
ffffffffffffff74 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
0000000000000000 RDI:
0000000000000001
RBP:
ffff88023e47dc18 R08:
0000000000000002 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
ffff88023f519658
R13:
ffffffff8118c300 R14:
0000000000000000 R15:
ffff88023f519640
FS:
00007f2070bf7700(0000) GS:
ffff880247400000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000000002b4 CR3:
000000023da1d000 CR4:
00000000000007e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Stack:
0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
Call Trace:
[<
ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
[<
ffffffff8118c10d>] blkdev_get+0x1dd/0x370
[<
ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
[<
ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
[<
ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
[<
ffffffff8118c365>] blkdev_open+0x65/0x80
[<
ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
[<
ffffffff8114d214>] finish_open+0x34/0x50
[<
ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
[<
ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
[<
ffffffff81115a8e>] ? might_fault+0x4e/0xa0
[<
ffffffff8115f4f0>] do_filp_open+0x40/0x90
[<
ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
[<
ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
[<
ffffffff8114e45f>] do_sys_open+0xef/0x1d0
[<
ffffffff8114e559>] SyS_open+0x19/0x20
[<
ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
a4 07 00 44 89
RIP [<
ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP <
ffff88023e47dbd8>
CR2:
00000000000002b4
---[ end trace
bb7f32dbf02398dc ]---
The brd change should be backported to stable kernels starting with 2.6.25.
The loop change should be backported to stable kernels starting with 2.6.22.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org # 2.6.22+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mikulas Patocka [Mon, 14 Oct 2013 16:12:24 +0000 (12:12 -0400)]
loop: fix crash if blk_alloc_queue fails
loop: fix crash if blk_alloc_queue fails
If blk_alloc_queue fails, loop_add cleans up, but it doesn't clean up the
identifier allocated with idr_alloc. That causes crash on module unload in
idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); where we attempt to
remove non-existed device with that id.
BUG: unable to handle kernel NULL pointer dereference at
0000000000000380
IP: [<
ffffffff812057c9>] del_gendisk+0x19/0x2d0
PGD
43d399067 PUD
43d0ad067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop(-) dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_powersave spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc lm85 hwmon_vid snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq ohci_hcd freq_table tg3 ehci_pci mperf ehci_hcd kvm_amd kvm sata_svw serverworks libphy libata ide_core k10temp usbcore hwmon microcode ptp pcspkr pps_core e100 skge mii usb_common i2c_piix4 floppy evdev rtc_cmos i2c_core processor but!
ton unix
CPU: 7 PID: 2735 Comm: rmmod Tainted: G W 3.10.15-devel #15
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
task:
ffff88043d38e780 ti:
ffff88043d21e000 task.ti:
ffff88043d21e000
RIP: 0010:[<
ffffffff812057c9>] [<
ffffffff812057c9>] del_gendisk+0x19/0x2d0
RSP: 0018:
ffff88043d21fe10 EFLAGS:
00010282
RAX:
ffffffffa05102e0 RBX:
0000000000000000 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
ffff88043ea82800 RDI:
0000000000000000
RBP:
ffff88043d21fe48 R08:
0000000000000000 R09:
0000000000000001
R10:
0000000000000001 R11:
0000000000000000 R12:
00000000000000ff
R13:
0000000000000080 R14:
0000000000000000 R15:
ffff88043ea82800
FS:
00007ff646534700(0000) GS:
ffff880447000000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
0000000000000380 CR3:
000000043e9bf000 CR4:
00000000000007e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Stack:
ffffffff8100aba4 0000000000000092 ffff88043d21fe48 ffff88043ea82800
00000000000000ff ffff88043d21fe98 0000000000000000 ffff88043d21fe60
ffffffffa05102b4 0000000000000000 ffff88043d21fe70 ffffffffa05102ec
Call Trace:
[<
ffffffff8100aba4>] ? native_sched_clock+0x24/0x80
[<
ffffffffa05102b4>] loop_remove+0x14/0x40 [loop]
[<
ffffffffa05102ec>] loop_exit_cb+0xc/0x10 [loop]
[<
ffffffff81217b74>] idr_for_each+0x104/0x190
[<
ffffffffa05102e0>] ? loop_remove+0x40/0x40 [loop]
[<
ffffffff8109adc5>] ? trace_hardirqs_on_caller+0x105/0x1d0
[<
ffffffffa05135dc>] loop_exit+0x34/0xa58 [loop]
[<
ffffffff810a98ea>] SyS_delete_module+0x13a/0x260
[<
ffffffff81221d5e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<
ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: f0 4c 8b 6d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 4c 8d af 80 00 00 00 41 54 53 48 89 fb 48 83 ec 18 <48> 83 bf 80 03 00
00 00 74 4d e8 98 fe ff ff 31 f6 48 c7 c7 20
RIP [<
ffffffff812057c9>] del_gendisk+0x19/0x2d0
RSP <
ffff88043d21fe10>
CR2:
0000000000000380
---[ end trace
64ec069ec70f1309 ]---
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org # 3.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mikulas Patocka [Mon, 14 Oct 2013 16:11:36 +0000 (12:11 -0400)]
blk-core: Fix memory corruption if blkcg_init_queue fails
If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
to clean up structures allocated by the backing dev.
------------[ cut here ]------------
WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
ODEBUG: free active (active state 0) object type: percpu_counter hint: (null)
Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
CPU: 0 PID: 2739 Comm: lvchange Tainted: G W
3.10.15-devel #14
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
Call Trace:
[<
ffffffff813c8fd4>] dump_stack+0x19/0x1b
[<
ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0
[<
ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50
[<
ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250
[<
ffffffff81229a15>] debug_print_object+0x85/0xa0
[<
ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250
[<
ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0
[<
ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0
[<
ffffffff811f672e>] blk_alloc_queue+0xe/0x10
[<
ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod]
[<
ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[<
ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod]
[<
ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[<
ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[<
ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod]
[<
ffffffff81097662>] ? get_lock_stats+0x22/0x70
[<
ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[<
ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520
[<
ffffffff8116cfc7>] ? fget_light+0x377/0x4e0
[<
ffffffff81161d2b>] SyS_ioctl+0x4b/0x90
[<
ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
---[ end trace
4b5ff0d55673d986 ]---
------------[ cut here ]------------
This fix should be backported to stable kernels starting with 2.6.37. Note
that in the kernels prior to 3.5 the affected code is different, but the
bug is still there - bdi_init is called and bdi_destroy isn't.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org # 2.6.37+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jeff Moyer [Tue, 8 Oct 2013 18:36:41 +0000 (14:36 -0400)]
block: fix race between request completion and timeout handling
crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[<
ffffffff8124e424>] [<
ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP: 0018:
ffff881057eefd60 EFLAGS:
00010012
RAX:
ffff881d99e3e8a8 RBX:
ffff881d99e3e780 RCX:
ffff881d99e3e8a8
RDX:
ffff881d99e3e8a8 RSI:
ffff881d99e3e780 RDI:
ffff881d99e3e780
RBP:
ffff881057eefd80 R08:
ffff881057eefe90 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
ffff881057f92338
R13:
0000000000000000 R14:
ffff881057f92338 R15:
ffff883058188000
FS:
0000000000000000(0000) GS:
ffff880040200000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0:
000000008005003b
CR2:
00000000006d3ec0 CR3:
000000302cd7d000 CR4:
00000000000406b0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo
ffff881057eee000, task
ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
<0>
ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
<0>
ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[<
ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
[<
ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
[<
ffffffff81362a23>] scsi_queue_insert+0x13/0x20
[<
ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
[<
ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
[<
ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
[<
ffffffff810908c6>] kthread+0x96/0xa0
[<
ffffffff8100c14a>] child_rip+0xa/0x20
[<
ffffffff81090830>] ? kthread+0x0/0xa0
[<
ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [<
ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP <
ffff881057eefd60>
The RIP is this line:
BUG_ON(blk_queued_rq(rq));
After digging through the code, I think there may be a race between the
request completion and the timer handler running.
A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:
static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}
and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.
Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.
This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.
This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Tue, 17 Sep 2013 20:30:31 +0000 (22:30 +0200)]
blktrace: Send BLK_TN_PROCESS events to all running traces
Currently each task sends BLK_TN_PROCESS event to the first traced
device it interacts with after a new trace is started. When there are
several traced devices and the task accesses more devices, this logic
can result in BLK_TN_PROCESS being sent several times to some devices
while it is never sent to other devices. Thus blkparse doesn't display
command name when parsing some blktrace files.
Fix the problem by sending BLK_TN_PROCESS event to all traced devices
when a task interacts with any of them.
Signed-off-by: Jan Kara <jack@suse.cz>
Review-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 3 Nov 2013 23:41:51 +0000 (15:41 -0800)]
Linux 3.12
Linus Torvalds [Sun, 3 Nov 2013 19:36:41 +0000 (11:36 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"Three fixes across arch/mips with the most complex one being the GIC
interrupt fix - at nine lines still not monster. I'm confident this
are the final MIPS patches even if there should go for an rc8"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: ralink: fix return value check in rt_timer_probe()
MIPS: malta: Fix GIC interrupt offsets
MIPS: Perf: Fix 74K cache map
Mathias Krause [Sun, 3 Nov 2013 11:36:28 +0000 (12:36 +0100)]
ipc, msg: forbid negative values for "msg{max,mnb,mni}"
Negative message lengths make no sense -- so don't do negative queue
lenghts or identifier counts. Prevent them from getting negative.
Also change the underlying data types to be unsigned to avoid hairy
surprises with sign extensions in cases where those variables get
evaluated in unsigned expressions with bigger data types, e.g size_t.
In case a user still wants to have "unlimited" sizes she could just use
INT_MAX instead.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 2 Nov 2013 17:27:29 +0000 (10:27 -0700)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/rusty/linux
Pull ARM kallsyms fix from Rusty Russell:
"Last minute perf unbreakage for ARM modules; spent a day in
linux-next"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
scripts/kallsyms: filter symbols not in kernel address space
Vineet Gupta [Sat, 2 Nov 2013 12:17:49 +0000 (17:47 +0530)]
ARC: Incorrect mm reference used in vmalloc fault handler
A vmalloc fault needs to sync up PGD/PTE entry from init_mm to current
task's "active_mm". ARC vmalloc fault handler however was using mm.
A vmalloc fault for non user task context (actually pre-userland, from
init thread's open for /dev/console) caused the handler to deref NULL mm
(for mm->pgd)
The reasons it worked so far is amazing:
1. By default (!SMP), vmalloc fault handler uses a cached value of PGD.
In SMP that MMU register is repurposed hence need for mm pointer deref.
2. In pre-3.12 SMP kernel, the problem triggering vmalloc didn't exist in
pre-userland code path - it was introduced with commit
20bafb3d23d108bc
"n_tty: Move buffers into n_tty_data"
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: stable@vger.kernel.org #3.10 and 3.11
Cc: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>