platform/kernel/linux-rpi.git
10 years agoMerge branch 'for-3.16/core' into for-3.16/drivers
Jens Axboe [Fri, 30 May 2014 14:11:50 +0000 (08:11 -0600)]
Merge branch 'for-3.16/core' into for-3.16/drivers

Pulled in for the blk_mq_tag_to_rq() change, which impacts
mtip32xx.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: blk_mq_tag_to_rq should handle flush request
Shaohua Li [Fri, 30 May 2014 14:06:42 +0000 (08:06 -0600)]
blk-mq: blk_mq_tag_to_rq should handle flush request

flush request is special, which borrows the tag from the parent
request. Hence blk_mq_tag_to_rq needs special handling to return
the flush request from the tag.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: remove dead code in scsi_ioctl:blk_verify_command
Dave Jones [Thu, 29 May 2014 19:11:30 +0000 (15:11 -0400)]
block: remove dead code in scsi_ioctl:blk_verify_command

filter gets assigned the address of blk_default_cmd_filter on
entry to this function, so the !filter condition can never be true.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: request initialization optimizations
Jens Axboe [Thu, 29 May 2014 17:00:11 +0000 (11:00 -0600)]
blk-mq: request initialization optimizations

We currently clear a lot more than we need to, so make that a bit
more clever. Make some of the init dependent on features, like
only setting start_time if we are going to use it.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: add queue flag for disabling SG merging
Jens Axboe [Thu, 29 May 2014 15:53:32 +0000 (09:53 -0600)]
block: add queue flag for disabling SG merging

If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.

Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: remove 'magic' from struct blk_plug
Jens Axboe [Thu, 29 May 2014 14:09:00 +0000 (08:09 -0600)]
block: remove 'magic' from struct blk_plug

I don't think we've ever caught any bugs with this, and there's the
list poisoning for the plug lists to catch uninitialized cases.
So remove the magic member and save 8 bytes in the struct.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoMerge branch 'stable/for-jens-3.16' of git://git.kernel.org/pub/scm/linux/kernel...
Jens Axboe [Wed, 28 May 2014 18:37:04 +0000 (12:37 -0600)]
Merge branch 'stable/for-jens-3.16' of git://git./linux/kernel/git/xen/tip into for-3.16/drivers

Konrad writes:

Please git pull the following branch:

git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.16

which has a bunch of fixes to the Xen block frontend and backend driver
and a new parameter for Xen backend driver - an override (set by the toolstack)
whether to expose the discard support (if disk of course supports it) or not.

10 years agoxen-blkback: defer freeing blkif to avoid blocking xenwatch
Valentin Priescu [Tue, 20 May 2014 20:28:50 +0000 (22:28 +0200)]
xen-blkback: defer freeing blkif to avoid blocking xenwatch

Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O
requests to finish. If the VBD is attached to a hot-swappable disk, then
xenwatch can hang for a long period of time, stalling other watches.

 INFO: task xenwatch:39 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782
 ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040
 ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040
 Call Trace:
 [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20
 [<ffffffff8128f761>] ? list_del+0x11/0x40
 [<ffffffff8147a080>] ? wait_for_common+0x60/0x160
 [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50
 [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
 [<ffffffff8147a26a>] schedule+0x3a/0x60
 [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback]
 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
 [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback]
 [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90
 [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0
 [<ffffffff81346488>] device_release_driver+0x28/0x40
 [<ffffffff813456e8>] bus_remove_device+0x78/0xe0
 [<ffffffff81342c9f>] device_del+0x12f/0x1a0
 [<ffffffff81342d2d>] device_unregister+0x1d/0x60
 [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback]
 [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback]
 [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60
 [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0
 [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120
 [<ffffffff8130b4cb>] frontend_changed+0xb/0x10
 [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130
 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
 [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80
 [<ffffffff810799d6>] kthread+0x96/0xa0
 [<ffffffff81485934>] kernel_thread_helper+0x4/0x10
 [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6
 [<ffffffff81485930>] ? gs_change+0x13/0x13

With this patch, when there is still pending I/O, the actual disconnect
is done by the last reference holder (last pending I/O request). In this
case, xenwatch doesn't block indefinitely.

Signed-off-by: Valentin Priescu <priescuv@amazon.com>
Reviewed-by: Steven Kady <stevkady@amazon.com>
Reviewed-by: Steven Noonan <snoonan@amazon.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
10 years agoxen blkif.h: fix comment typo in discard-alignment
Olaf Hering [Wed, 21 May 2014 14:32:41 +0000 (16:32 +0200)]
xen blkif.h: fix comment typo in discard-alignment

Add the missing 'n' to discard-alignment

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
10 years agoxen/blkback: disable discard feature if requested by toolstack
Olaf Hering [Wed, 21 May 2014 14:32:42 +0000 (16:32 +0200)]
xen/blkback: disable discard feature if requested by toolstack

Newer toolstacks may provide a boolean property "discard-enable" in the
backend node. Its purpose is to disable discard for file backed storage
to avoid fragmentation. Recognize this setting also for physical
storage.  If that property exists and is false, do not advertise
"feature-discard" to the frontend.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
10 years agoxen-blkfront: remove type check from blkfront_setup_discard
Olaf Hering [Wed, 21 May 2014 14:32:40 +0000 (16:32 +0200)]
xen-blkfront: remove type check from blkfront_setup_discard

In its initial implementation a check for "type" was added, but only phy
and file are handled. This breaks advertised discard support for other
type values such as qdisk.

Fix and simplify this function: If the backend advertises discard
support it is supposed to implement it properly, so enable
feature_discard unconditionally. If the backend advertises the need for
a certain granularity and alignment then propagate both properties to
the blocklayer. The discard-secure property is a boolean, update the code
to reflect that.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
10 years agoMerge branch 'for-3.16/core' into for-3.16/drivers
Jens Axboe [Wed, 28 May 2014 16:18:51 +0000 (10:18 -0600)]
Merge branch 'for-3.16/core' into for-3.16/drivers

Pull in core changes (again), since we got rid of the alloc/free
hctx mq_ops hooks and mtip32xx then needed updating again.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: remove alloc_hctx and free_hctx methods
Christoph Hellwig [Wed, 28 May 2014 16:11:06 +0000 (18:11 +0200)]
blk-mq: remove alloc_hctx and free_hctx methods

There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: add file comments and update copyright notices
Jens Axboe [Wed, 28 May 2014 16:15:41 +0000 (10:15 -0600)]
blk-mq: add file comments and update copyright notices

None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoMerge branch 'for-3.16/core' into for-3.16/drivers
Jens Axboe [Wed, 28 May 2014 15:50:26 +0000 (09:50 -0600)]
Merge branch 'for-3.16/core' into for-3.16/drivers

mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the
core changes so we have a properly merged end result.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: remove blk_mq_alloc_request_pinned
Christoph Hellwig [Tue, 27 May 2014 18:59:50 +0000 (20:59 +0200)]
blk-mq: remove blk_mq_alloc_request_pinned

We now only have one caller left and can open code it there in a cleaner
way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request
Christoph Hellwig [Tue, 27 May 2014 18:59:49 +0000 (20:59 +0200)]
blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request

We already do a non-blocking allocation in blk_mq_map_request, no need
to repeat it.  Just call __blk_mq_alloc_request to wait directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: remove blk_mq_wait_for_tags
Christoph Hellwig [Tue, 27 May 2014 18:59:48 +0000 (20:59 +0200)]
blk-mq: remove blk_mq_wait_for_tags

The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.

Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: initialize request in __blk_mq_alloc_request
Christoph Hellwig [Tue, 27 May 2014 18:59:47 +0000 (20:59 +0200)]
blk-mq: initialize request in __blk_mq_alloc_request

Both callers if __blk_mq_alloc_request want to initialize the request, so
lift it into the common path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request
Christoph Hellwig [Tue, 27 May 2014 18:59:46 +0000 (20:59 +0200)]
blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request

Instead of having two almost identical copies of the same code just let
the callers pass in the reserved flag directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: add helper to insert requests from irq context
Christoph Hellwig [Wed, 28 May 2014 14:08:02 +0000 (08:08 -0600)]
blk-mq: add helper to insert requests from irq context

Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes.  Replace them with a per-request_queue list of requests to
be requeued.

Based on an earlier test by Ming Lei.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: remove stale comment for blk_mq_complete_request()
Jens Axboe [Wed, 28 May 2014 14:06:34 +0000 (08:06 -0600)]
blk-mq: remove stale comment for blk_mq_complete_request()

It works for both IPI and local completions as of commit
95f096849932.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agofloppy: do not corrupt bio.bi_flags when reading block 0
Jiri Kosina [Wed, 28 May 2014 09:55:23 +0000 (11:55 +0200)]
floppy: do not corrupt bio.bi_flags when reading block 0

Commit 41a55b4de39 ("floppy: silence warning during disk test") caused
bio.bi_flags being overwritten, and its initialization to BIO_UPTODATE
in bio_init() to be lost.

This was unnoticed until 7b7b68bba5 ("floppy: bail out in open() if
drive is not responding to block0 read"), because the error value wasn't
checked for in the bio completion callback.

Now we are actually looking at the error, and the loss of BIO_UPTODATE
causes EIO to be wrongly passed to the callback, which confuses the
FD_OPEN_SHOULD_FAIL_BIT logic.

Fix this by not destroying previous value of bi_flags when setting
BIO_QUIET.

Cc: Stephen Hemminger <shemminger@vyatta.com>
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
10 years agoblk-mq: allow non-softirq completions
Jens Axboe [Tue, 27 May 2014 23:46:48 +0000 (17:46 -0600)]
blk-mq: allow non-softirq completions

Right now we export two ways of completing a request:

1) blk_mq_complete_request(). This uses an IPI (if needed) and
   completes through q->softirq_done_fn(). It also works with
   timeouts.

2) blk_mq_end_io(). This completes inline, and ignores any timeout
   state of the request.

Let blk_mq_complete_request() handle non-softirq_done_fn completions
as well, by just completing inline. If a driver has enough completion
ports to place completions correctly, it need not define a
mq_ops->complete() and we can avoid an indirect function call by
doing the completion inline.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: pass in suggested NUMA node to ->alloc_hctx()
Jens Axboe [Tue, 27 May 2014 18:06:53 +0000 (12:06 -0600)]
blk-mq: pass in suggested NUMA node to ->alloc_hctx()

Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.

So just pass in the suggested node directly to the alloc
function.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: only allocate/free mq_usage_counter in blk-mq
Ming Lei [Tue, 27 May 2014 15:35:14 +0000 (23:35 +0800)]
block: only allocate/free mq_usage_counter in blk-mq

The percpu counter is only used for blk-mq, so move
its allocation and free inside blk-mq, and don't
allocate it for legacy queue device.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: avoid code duplication
Ming Lei [Tue, 27 May 2014 15:35:13 +0000 (23:35 +0800)]
blk-mq: avoid code duplication

blk_mq_exit_hw_queues() and blk_mq_free_hw_queues()
are introduced to avoid code duplication.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: fix leak of hctx->ctx_map
Ming Lei [Tue, 27 May 2014 14:34:45 +0000 (08:34 -0600)]
blk-mq: fix leak of hctx->ctx_map

hctx->ctx_map should have been freed inside blk_mq_free_queue().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock/blk-lib.c: make __blkdev_issue_zeroout static
Fabian Frederick [Mon, 26 May 2014 20:19:14 +0000 (22:19 +0200)]
block/blk-lib.c: make __blkdev_issue_zeroout static

__blkdev_issue_zeroout is only used in blk-lib.c

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: idle all hardware contexts before freeing a queue
Christoph Hellwig [Mon, 26 May 2014 09:45:02 +0000 (11:45 +0200)]
blk-mq: idle all hardware contexts before freeing a queue

Without this we can leak the active_queues reference if a command is
freed while it is considered active.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: allow setting of per-request timeouts
Jens Axboe [Fri, 23 May 2014 20:14:57 +0000 (14:14 -0600)]
blk-mq: allow setting of per-request timeouts

Currently blk-mq uses the queue timeout for all requests. But
for some commands, drivers may want to set a specific timeout
for special requests. Allow this to be passed in through
request->timeout, and use it if set.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: export blk_mq_tag_busy_iter
Sam Bradshaw [Fri, 23 May 2014 19:30:16 +0000 (13:30 -0600)]
blk-mq: export blk_mq_tag_busy_iter

Export the blk-mq in-flight tag iterator for driver consumption.
This is particularly useful in exception paths or SRSI where
in-flight IOs need to be cancelled and/or reissued. The NVMe driver
conversion will use this.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: split make request handler for multi and single queue
Jens Axboe [Thu, 22 May 2014 16:40:51 +0000 (10:40 -0600)]
blk-mq: split make request handler for multi and single queue

We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
  for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
  we want to go straight to hardware for SYNC IO.

Split blk_mq_make_request() into a blk_sq_make_request() for single
queue devices, and retain blk_mq_make_request() for multi queue
devices. Then we don't need multiple checks for q->nr_hw_queues
in the request mapping.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: save memory by freeing requests on unused hardware queues
Jens Axboe [Wed, 21 May 2014 20:01:15 +0000 (14:01 -0600)]
blk-mq: save memory by freeing requests on unused hardware queues

Depending on the topology of the machine and the number of queues
exposed by a device, we can end up in a situation where some of
the hardware queues are unused (as in, they don't map to any
software queues). For this case, free up the memory used by the
request map, as we will not use it. This can be a substantial
amount of memory, depending on the number of queues vs CPUs and
the queue depth of the device.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: allow the hctx cpu hotplug notifier to return errors
Jens Axboe [Wed, 21 May 2014 19:59:08 +0000 (13:59 -0600)]
blk-mq: allow the hctx cpu hotplug notifier to return errors

Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: Micro-optimize blk_queue_nomerges() check
Robert Elliott [Tue, 20 May 2014 21:46:26 +0000 (16:46 -0500)]
blk-mq: Micro-optimize blk_queue_nomerges() check

In blk_mq_make_request(), do the blk_queue_nomerges() check
outside the call to blk_attempt_plug_merge() to eliminate
function call overhead when nomerges=2 (disabled)

Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: initialize q->nr_requests after calling blk_queue_make_request()
Jens Axboe [Tue, 20 May 2014 21:17:27 +0000 (15:17 -0600)]
blk-mq: initialize q->nr_requests after calling blk_queue_make_request()

blk_queue_make_requests() overwrites our set value for q->nr_requests,
turning it into the default of 128. Set this appropriately after
initializing queue values in blk_queue_make_request().

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agomtip32xx: move error handling to service thread
Asai Thambi S P [Tue, 20 May 2014 17:48:56 +0000 (10:48 -0700)]
mtip32xx: move error handling to service thread

Move error handling to service thread, and use mtip_set_timeout()
to set timeouts for HDIO_DRIVE_TASK and HDIO_DRIVE_CMD IOCTL commands.

Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: allow changing of queue depth through sysfs
Jens Axboe [Tue, 20 May 2014 17:49:02 +0000 (11:49 -0600)]
blk-mq: allow changing of queue depth through sysfs

For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agohtmldocs: fix bio.c location
Jens Axboe [Tue, 20 May 2014 14:17:35 +0000 (08:17 -0600)]
htmldocs: fix bio.c location

Commit f9c78b2be2ca moved bio.c from fs/ to block/, but didn't
update the docbook location. Fix that up.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: move mm/bounce.c to block/
Jens Axboe [Tue, 20 May 2014 02:01:52 +0000 (20:01 -0600)]
block: move mm/bounce.c to block/

Continue moving some of the block files that are scattered around.
bounce.c contains only code for bouncing the contents of a bio.
It's block proper code, not mm code.

Suggested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoMerge branch 'for-3.16/blk-mq-tagging' into for-3.16/core
Jens Axboe [Mon, 19 May 2014 17:52:35 +0000 (11:52 -0600)]
Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core

Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
block/blk-mq-tag.c

10 years agoblk-mq: switch ctx pending map to the sparser blk_align_bitmap
Jens Axboe [Mon, 19 May 2014 15:23:55 +0000 (09:23 -0600)]
blk-mq: switch ctx pending map to the sparser blk_align_bitmap

Each hardware queue has a bitmap of software queues with pending
requests. When new IO is queued on a software queue, the bit is
set, and when IO is pruned on a hardware queue run, the bit is
cleared. This causes a lot of traffic. Switch this from the regular
BITS_PER_LONG bitmap to a sparser layout, similarly to what was
done for blk-mq tagging.

20% performance increase was observed for single threaded IO, and
about 15% performanc increase on multiple threads driving the
same device.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: move the cache friendly bitmap type of out blk-mq-tag
Jens Axboe [Mon, 19 May 2014 15:17:48 +0000 (09:17 -0600)]
blk-mq: move the cache friendly bitmap type of out blk-mq-tag

We will use it for the pending list in blk-mq core as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: move ioprio.c from fs/ to block/
Jens Axboe [Mon, 19 May 2014 17:02:18 +0000 (11:02 -0600)]
block: move ioprio.c from fs/ to block/

Like commit f9c78b2b, move this block related file outside
of fs/ and into the core block directory, block/.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: move bio.c and bio-integrity.c from fs/ to block/
Jens Axboe [Mon, 19 May 2014 14:16:41 +0000 (08:16 -0600)]
block: move bio.c and bio-integrity.c from fs/ to block/

They really belong in block/, especially now since it's not in
drivers/block/ anymore. Additionally, the get_maintainer script
gets it wrong when in fs/.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agovirtio_blk: fix race between start and stop queue
Ming Lei [Fri, 16 May 2014 15:31:21 +0000 (23:31 +0800)]
virtio_blk: fix race between start and stop queue

When there isn't enough vring descriptor for adding to vq,
blk-mq will be put as stopped state until some of pending
descriptors are completed & freed.

Unfortunately, the vq's interrupt may come just before
blk-mq's BLK_MQ_S_STOPPED flag is set, so the blk-mq will
still be kept as stopped even though lots of descriptors
are completed and freed in the interrupt handler. The worst
case is that all pending descriptors are freed in the
interrupt handler, and the queue is kept as stopped forever.

This patch fixes the problem by starting/stopping blk-mq
with holding vq_lock.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agomtip32xx: stop block hardware queues before quiescing IO
Jens Axboe [Wed, 14 May 2014 14:22:56 +0000 (08:22 -0600)]
mtip32xx: stop block hardware queues before quiescing IO

We need to stop the block layer queues to prevent new "normal"
IO from entering the driver, while we wait for existing commands
to finish.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agomtip32xx: blk_mq_init_queue() returns an ERR_PTR
Dan Carpenter [Wed, 14 May 2014 12:54:18 +0000 (15:54 +0300)]
mtip32xx: blk_mq_init_queue() returns an ERR_PTR

We changed this from blk_alloc_queue_node() to blk_mq_init_queue() so
the check needs to be updated as well.

Fixes: ffc771b3ca8b2 ('mtip32xx: convert to use blk-mq')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agomtip32xx: convert to use blk-mq
Jens Axboe [Fri, 9 May 2014 15:42:02 +0000 (09:42 -0600)]
mtip32xx: convert to use blk-mq

This rips out timeout handling, requeueing, etc in converting
it to use blk-mq instead.

Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: improve support for shared tags maps
Jens Axboe [Tue, 13 May 2014 21:10:52 +0000 (15:10 -0600)]
blk-mq: improve support for shared tags maps

This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.

If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.

The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoMerge branch 'for-3.16/blk-mq-tagging' into for-3.16/core
Jens Axboe [Sat, 10 May 2014 21:44:42 +0000 (15:44 -0600)]
Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core

10 years agoblk-mq: bitmap tag: cleanup blk_mq_init_tags
Ming Lei [Sat, 10 May 2014 17:01:51 +0000 (01:01 +0800)]
blk-mq: bitmap tag: cleanup blk_mq_init_tags

Both nr_cache and nr_tags arn't needed for bitmap tag anymore.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: bitmap tag: select random tag betweet 0 and (depth - 1)
Ming Lei [Sat, 10 May 2014 21:43:14 +0000 (15:43 -0600)]
blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1)

The selected tag should be selected at random between 0 and
(depth - 1) with probability 1/depth, instead between 0 and
(depth - 2) with probability 1/(depth - 1).

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: bitmap tag: remove barrier in bt_clear_tag()
Ming Lei [Sat, 10 May 2014 17:01:49 +0000 (01:01 +0800)]
blk-mq: bitmap tag: remove barrier in bt_clear_tag()

The barrier isn't necessary because both atomic_dec_and_test()
and wake_up() implicate one barrier.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag()
Ming Lei [Sat, 10 May 2014 17:01:48 +0000 (01:01 +0800)]
blk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag()

The unlock memory barrier need to order access to req in free
path and clearing tag bit, otherwise either request free path
may see a allocated request, or initialized request in allocate
path might be modified by the ongoing free path.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: only calculate part_in_flight() once
Jens Axboe [Fri, 9 May 2014 21:48:23 +0000 (15:48 -0600)]
block: only calculate part_in_flight() once

We first check if we have inflight IO, then retrieve that
same number again. Usually this isn't that costly since the
chance of having the data dirtied in between is small, but
there's no reason for calling part_in_flight() twice.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: fix race in IO start accounting
Jens Axboe [Fri, 9 May 2014 20:54:08 +0000 (14:54 -0600)]
blk-mq: fix race in IO start accounting

Commit c6d600c6 opened up a small race where we could attempt to
account IO completion on a request, racing with IO start accounting.
Fix this up by ensuring that we've accounted for IO start before
inserting the request.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: use sparser tag layout for lower queue depth
Jens Axboe [Fri, 9 May 2014 19:41:15 +0000 (13:41 -0600)]
blk-mq: use sparser tag layout for lower queue depth

For best performance, spreading tags over multiple cachelines
makes the tagging more efficient on multicore systems. But since
we have 8 * sizeof(unsigned long) tags per cacheline, we don't
always get a nice spread.

Attempt to spread the tags over at least 4 cachelines, using fewer
number of bits per unsigned long if we have to. This improves
tagging performance in setups with 32-128 tags. For higher depths,
the spread is the same as before (BITS_PER_LONG tags per cacheline).

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: implement new and more efficient tagging scheme
Jens Axboe [Fri, 9 May 2014 15:36:49 +0000 (09:36 -0600)]
blk-mq: implement new and more efficient tagging scheme

blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.

This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:

1) We cache the last allocated (or freed) tag on a per blk-mq
   software context basis. This allows us to limit the space
   we have to search. The key element here is not caching it
   in the shared tag structure, otherwise we end up dirtying
   more shared cache lines on each allocate/free operation.

2) The tag space is split into cache line sized groups, and
   each context will start off randomly in that space. Even up
   to full utilization of the space, this divides the tag users
   efficiently into cache line groups, avoiding dirtying the same
   one both between allocators and between allocator and freeer.

This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: initialize struct request fields individually
Christoph Hellwig [Tue, 6 May 2014 10:12:45 +0000 (12:12 +0200)]
blk-mq: initialize struct request fields individually

This allows us to avoid a non-atomic memset over ->atomic_flags as well
as killing lots of duplicate initializations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: update a hotplug comment for grammar
Jens Axboe [Thu, 8 May 2014 20:50:19 +0000 (14:50 -0600)]
blk-mq: update a hotplug comment for grammar

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: add basic round-robin of what CPU to queue workqueue work on
Jens Axboe [Wed, 7 May 2014 16:26:44 +0000 (10:26 -0600)]
blk-mq: add basic round-robin of what CPU to queue workqueue work on

Right now we just pick the first CPU in the mask, but that can
easily overload that one. Add some basic batching and round-robin
all the entries in the mask instead.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove unnecessary prototype for cdrom_get_disc_info
Joe Perches [Mon, 5 May 2014 00:05:13 +0000 (17:05 -0700)]
cdrom: Remove unnecessary prototype for cdrom_get_disc_info

Move the function to the proper spot instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove unnecessary prototype for cdrom_mrw_exit
Joe Perches [Mon, 5 May 2014 00:05:12 +0000 (17:05 -0700)]
cdrom: Remove unnecessary prototype for cdrom_mrw_exit

Move the function to appropriate locations instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove cdrom_count_tracks prototype
Joe Perches [Mon, 5 May 2014 00:05:11 +0000 (17:05 -0700)]
cdrom: Remove cdrom_count_tracks prototype

Move function to proper location instead.
Fix whitespace and embedded if too.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove cdrom_get_next_writeable prototype
Joe Perches [Mon, 5 May 2014 00:05:10 +0000 (17:05 -0700)]
cdrom: Remove cdrom_get_next_writeable prototype

Move the function to the right spot instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove cdrom_get_last_written prototype
Joe Perches [Mon, 5 May 2014 00:05:09 +0000 (17:05 -0700)]
cdrom: Remove cdrom_get_last_written prototype

Move the function instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype
Joe Perches [Mon, 5 May 2014 00:05:08 +0000 (17:05 -0700)]
cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype

Neaten the spacing too.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove unnecessary sanitize_format prototype
Joe Perches [Mon, 5 May 2014 00:05:07 +0000 (17:05 -0700)]
cdrom: Remove unnecessary sanitize_format prototype

It's defined below without being called.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove unnecessary check_for_audio_disc prototype
Joe Perches [Mon, 5 May 2014 00:05:06 +0000 (17:05 -0700)]
cdrom: Remove unnecessary check_for_audio_disc prototype

The actual static is defined below it but not used until later.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove prototype for open_for_data
Joe Perches [Mon, 5 May 2014 00:05:05 +0000 (17:05 -0700)]
cdrom: Remove prototype for open_for_data

Move static function to the appropriate place to remove
the now unnecessary prototype.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove obfuscating IOCTL_IN and IOCTL_OUT macros
Joe Perches [Mon, 5 May 2014 00:05:04 +0000 (17:05 -0700)]
cdrom: Remove obfuscating IOCTL_IN and IOCTL_OUT macros

Macros with hidden control flow aren't nice.
Just use copy_to/from_user directly instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: Remove unused CHECKAUDIO macro
Joe Perches [Mon, 5 May 2014 00:05:03 +0000 (17:05 -0700)]
cdrom: Remove unused CHECKAUDIO macro

It's unused, make it disappear.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agocdrom: convert cdinfo to cd_dbg
Joe Perches [Mon, 5 May 2014 00:05:02 +0000 (17:05 -0700)]
cdrom: convert cdinfo to cd_dbg

It's a debugging message, mark it so.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock/blk-throttle.c: fix return of 0/1 with return type bool
Fabian Frederick [Fri, 2 May 2014 16:28:17 +0000 (18:28 +0200)]
block/blk-throttle.c: fix return of 0/1 with return type bool

Fix 4 coccinelle warnings.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock/blk-iopoll.c: use iop instead of iopoll
Fabian Frederick [Fri, 2 May 2014 16:21:45 +0000 (18:21 +0200)]
block/blk-iopoll.c: use iop instead of iopoll

All blk_iopoll functions use iop for parent iopoll structure except
blk_iopoll_complete.This also fixes one kernel-doc warning.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblk-mq: remove extra requeue trace
Jens Axboe [Fri, 2 May 2014 17:24:48 +0000 (11:24 -0600)]
blk-mq: remove extra requeue trace

We already issue a blktrace requeue event in
__blk_mq_requeue_request(), don't do it from the original caller
as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: null_blk: fix use after free
Ming Lei [Thu, 1 May 2014 07:12:36 +0000 (15:12 +0800)]
block: null_blk: fix use after free

entry(cmd->ll_list) may belong to new request once end_cmd()
returns, so fix the bug with the patch.

Without the change, it is easy to observe oops when
doing null_blk(timer) test.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agoblock: Fix format string mismatch in cfq-iosched.c
Masanari Iida [Mon, 28 Apr 2014 03:38:34 +0000 (12:38 +0900)]
block: Fix format string mismatch in cfq-iosched.c

Fix format string mismatch in cfq_var_show()

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: use list_first_entry_or_null in first_peer_device/first_connection
Lars Ellenberg [Mon, 28 Apr 2014 16:43:35 +0000 (18:43 +0200)]
drbd: use list_first_entry_or_null in first_peer_device/first_connection

If there are no peer_devices or connections, I'd rather have NULL
than some "arbitrary" address pretending to point to a struct.

Helps to avoid hard to debug symptoms, in case we ever try to use
and dereference a drbd_connection or drbd_peer_device
where we in fact don't have any connection at all.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: Allow attaching of a newly created device to any backing device
Philipp Reisner [Mon, 28 Apr 2014 16:43:34 +0000 (18:43 +0200)]
drbd: Allow attaching of a newly created device to any backing device

A newly created device was never exposed before, i.e. has a
exposed_data_uuid of 0. Then it is valid to attach to any current_uuid
of a backing device (of course also to a newly created one (4))

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: Test cstate while holding req_lock
Philipp Reisner [Mon, 28 Apr 2014 16:43:33 +0000 (18:43 +0200)]
drbd: Test cstate while holding req_lock

In case a connection transitions into C_TIMEOUT within the timer
function (request_timer_fn()) we need to make sure that the receiver
thread (potentially running on a different CPU) sees the updated
cstate later on.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: use blk_set_stacking_limits()
Philipp Reisner [Mon, 28 Apr 2014 16:43:32 +0000 (18:43 +0200)]
drbd: use blk_set_stacking_limits()

...instead directly assigning to q->limits.discard_zeroes_data

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: evaluate disk and network timeout on different requests
Lars Ellenberg [Mon, 28 Apr 2014 16:43:31 +0000 (18:43 +0200)]
drbd: evaluate disk and network timeout on different requests

Just because it is the oldest not yet completed request
does not make it the oldest request waiting for disk.
Or waiting for the peer.

And we completely missed already completed requests
that would still hold references to activity log extents,
waiting only for the barrier ack.

Find two oldest not yet completely processed requests,
one that is still waiting for local completion,
and one that is still waiting for some response from the peer.
These may or may not be the same request object.

Then separately apply the network and disk timeouts, respectively.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: Fix a hole in the challange-response connection authentication
Philipp Reisner [Mon, 28 Apr 2014 16:43:30 +0000 (18:43 +0200)]
drbd: Fix a hole in the challange-response connection authentication

In the implementation as it was, the two peers sent each other
a challenge, and expects the challenge hashed with the shared
secret back.

A attacker could simply wait for the challenge of the peer, and
send the same challenge back. Then it waits for the response, and
sends the same response back.

Prevent this by not accepting a challenge from the peer that is
the same as the challenge sent to the peer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: always implicitly close last epoch when idle
Lars Ellenberg [Mon, 28 Apr 2014 16:43:29 +0000 (18:43 +0200)]
drbd: always implicitly close last epoch when idle

Once our sender thread needs to wait_for_work(),
and actually needs to schedule(), just before we do that,
we already check if it is useful to implicitly close the last epoch.

The condition was too strict: only implicitly close the epoch,
if there have been no new (write) requests at all.

The assumption was that if there were new requests, they would
always be communicated one way or another, and would send necessary
epoch separating barriers explicitly.

This is not always true, e.g. when becoming diskless,
or while explicitly starting a full resync.

The last communicated epoch could stay open for a long time,
locking down corresponding activity log extents.

It is safe to always implicitly send that last barrier, as soon as we
determin that there cannot be more requests in the last communicated
epoch, even if there have been (uncommunicated) new requests in new
epochs meanwhile.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: add back some fairness to AL transactions
Lars Ellenberg [Mon, 28 Apr 2014 16:43:28 +0000 (18:43 +0200)]
drbd: add back some fairness to AL transactions

When batching more updates to the activity log into single transactions,
we lost the ability for new requests to force themselves into the active
set: all preparation steps became non-blocking, and if all currently
hot extents keep busy, they could starve out new incoming requests
to cold extents for quite a while.

This can only happen if your IO backend accepts more IO operations per
average DRBD replication round trip time than you have al-extents
configured.

If we have incoming requests to cold extents,
at least do one blocking update per transaction.

In an artificial worst-case workload on SSD with an asynchronous 600 ms
replication link, with al-extents = 7 (the minimum we allow), and
concurrent full resynch, without this patch, some write requests have
been observed to be starved for 40 seconds.
With this patch, application observed a worst case latency of twice the
replication round trip time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: keep max-bio size during detach/attach on disconnected primary
Lars Ellenberg [Mon, 28 Apr 2014 16:43:27 +0000 (18:43 +0200)]
drbd: keep max-bio size during detach/attach on disconnected primary

We want to store in persistent meta data what the peer DRBD can handle,
which, due to spreading requests to multiple bios,
may be more than its backing device can handle.

Otherwise, if a disconnected Primary temporarily loses access to its local data
as well, we may accidentally shrink the max-bio setting, portentially causing
already assembled, but not yet processed, application bios to be spuriously
failed due to device limits.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: fix a race between start_resync and send_and_submit
Lars Ellenberg [Mon, 28 Apr 2014 16:43:26 +0000 (18:43 +0200)]
drbd: fix a race between start_resync and send_and_submit

In the drbd make request function, specifically in
drbd_send_and_submit(), we decide whether we want to send the actual
write request, or only a "set this block out of sync" information.

We do so based on the current connection state, while holding the req_lock.
The connection state is not supposed to change while holding the req_lock.

But in drbd_start_resync, we did change that state anyways,
while only holding the global_state_lock, which is enough to change
sync-after dependencies (paused vs active resync), but
not good enough to change the connection state.

Fix: in drbd_start_resync, first grab the req_lock to serialize with
drbd_send_and_submit(), before grabbing the global_state_lock
to be able to evaluate the sync-after dependencies.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: Enable QUEUE_FLAG_DISCARD only if the peer can recieve P_TRIM
Lars Ellenberg [Mon, 28 Apr 2014 16:43:25 +0000 (18:43 +0200)]
drbd: Enable QUEUE_FLAG_DISCARD only if the peer can recieve P_TRIM

Allow the user of REQ_DISCARD.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: prepare sending side for REQ_DISCARD
Lars Ellenberg [Mon, 28 Apr 2014 16:43:24 +0000 (18:43 +0200)]
drbd: prepare sending side for REQ_DISCARD

Note that I do NOT call __drbd_chk_io_error for failed REQ_DISCARD.
That may be wrong, though, or needs to differ between EOPNOTSUPP and
other errors...

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: prepare receiving side for REQ_DISCARD
Lars Ellenberg [Mon, 28 Apr 2014 16:43:23 +0000 (18:43 +0200)]
drbd: prepare receiving side for REQ_DISCARD

If the receiver needs to serve a discard request on a queue that does
not announce to be discard cabable, it falls back to do synchronous
blkdev_issue_zeroout().

We expect only "reasonably" large (up to one activity log extent?)
discard requests.

We do this to not to not block the receiver for too long in this
fallback code path, and to not set/clear too many bits inside one
spinlock_irq_save() in drbd_set_in_sync/drbd_set_out_of_sync,

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: allow parallel promote/demote actions
Lars Ellenberg [Mon, 28 Apr 2014 16:43:22 +0000 (18:43 +0200)]
drbd: allow parallel promote/demote actions

We plan to use genl_family->parallel_ops = true in the future,
but need to review all possible interactions first.

For now, only selectively drop genl_lock() in drbd_set_role(),
instead serializing on our own internal resource->conf_update mutex.

We now can be promoted/demoted on many resources in parallel,
which may significantly improve cluster failover times
when fencing is required.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: perpare for genetlink parallel_ops
Lars Ellenberg [Mon, 28 Apr 2014 16:43:21 +0000 (18:43 +0200)]
drbd: perpare for genetlink parallel_ops

Because all administrative requests via genetlink have been globally
serialized via genl_lock(), we used to have one static struct
drbd_config_context "admin context".

Move this on-stack to the respective callback functions.

This will allow us to selectively drop the genl_lock()
(or use genl_family->parallel_ops) in the future.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: Do not BUG() when connection breaks in a special way
Philipp Reisner [Mon, 28 Apr 2014 16:43:20 +0000 (18:43 +0200)]
drbd: Do not BUG() when connection breaks in a special way

When a 'cluster wide' disconnect executes, the result comes back
from the peer, and immediately after that the connection breaks
then _conn_rq_cond() reported back SS_CW_SUCCESS.
Therefore _conn_request_state() calls conn_set_state(), which
has a BUG() in it.
The BUG() is hit because conn_is_valid_transition() does not like
the transaction. Which goes back to is_valid_soft_transition()
returning SS_OUTDATE_WO_CONN.

This fix is to consider an error reported by is_valid_soft_transition()
even when the peer agreed to the transaction.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: don't let application IO pre-empt resync too often
Lars Ellenberg [Mon, 28 Apr 2014 16:43:19 +0000 (18:43 +0200)]
drbd: don't let application IO pre-empt resync too often

Before, application IO could pre-empt resync activity
for up to hardcoded 20 seconds per resync request.
A very busy server could throttle the effective resync bandwidth
down to one request per 20 seconds.

Now, we only let application IO pre-empt resync traffic
while the current resync rate estimate is above c-min-rate.

If you disable the c-min-rate throttle feature (set c-min-rate = 0),
application IO will no longer pre-empt resync traffic at all.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: fix potential distributed deadlock during verify or resync
Lars Ellenberg [Mon, 28 Apr 2014 16:43:18 +0000 (18:43 +0200)]
drbd: fix potential distributed deadlock during verify or resync

If max-buffers and socket buffer sizes are "too small" for the chosen
resync rate, this could lead potentially lead to a distributed deadlock,
which may or may not resolve itself via the "ko-count" and request
timeout mechanism, or could be resolved by forced disconnect.

One option to deal with this is proper configuration:
use larger max-buffer and socket buffers settings,
or reduce the resync rate.

But even with bad configuration we should not deadlock,
but "gracefully" recover.

The issue is avoided by using only up to max-buffers/2 for resync
requests, and by using max-buffers not as a hard limit for data buffer
allocations, but as a throttle threshold only.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: resync: fix too large bursts for very slow rates
Lars Ellenberg [Mon, 28 Apr 2014 16:43:17 +0000 (18:43 +0200)]
drbd: resync: fix too large bursts for very slow rates

While merging adjacent dirty blocks into resync requests,
the resync rate throttle was disregarded.
For very low resync rates, the effective rate may have exceeded
the intended rate by a larger margin.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
10 years agodrbd: fix stalled resync detection in /proc/drbd
Lars Ellenberg [Mon, 28 Apr 2014 16:43:16 +0000 (18:43 +0200)]
drbd: fix stalled resync detection in /proc/drbd

If we don't make resync or verify progress for "too long",
we want to flag it as "stalled".

Since 2010, "use rolling marks for resync speed calculation"
this "too long" was wrong by a factor of HZ.
With HZ 250, it would have been flagged as stalled
after 100 minutes.

Hardcode 3 minutes instead.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>