CPUs, here are, first, the limits of BFQ for three different CPUs, on,
respectively, an average laptop, an old desktop, and a cheap embedded
system, in case full hierarchical support is enabled (i.e.,
-CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
+CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not
set (Section 4-2):
- Intel i7-4850HQ: 400 KIOPS
- AMD A8-3850: 250 KIOPS
- ARM CortexTM-A53 Octa-core: 80 KIOPS
-If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical
+If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical
support is enabled), then the sustainable throughput with BFQ
decreases, because all blkio.bfq* statistics are created and updated
(Section 4-2). For BFQ, this leads to the following maximum
As for cgroups-v1 (blkio controller), the exact set of stat files
created, and kept up-to-date by bfq, depends on whether
-CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
+CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all
the stat files documented in
Documentation/cgroup-v1/blkio-controller.rst. If, instead,
-CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
+CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files
blkio.bfq.io_service_bytes
blkio.bfq.io_service_bytes_recursive
blkio.bfq.io_serviced
blkio.bfq.io_serviced_recursive
-The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum
+The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum
throughput sustainable with bfq, because updating the blkio.bfq.*
stats is rather costly, especially for some of the stats enabled by
-CONFIG_DEBUG_BLK_CGROUP.
+CONFIG_BFQ_CGROUP_DEBUG.
Parameters to set
-----------------
struct bvec_iter bi_iter; /* current index into bio_vec array */
unsigned int bi_size; /* total size in bytes */
- unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
unsigned short bi_hw_segments; /* segments after DMA remapping */
unsigned int bi_max; /* max bio_vecs we can hold
used as index into pool */
This file allows to turn off the disk entropy contribution. Default
value of this file is '1'(on).
+chunk_sectors (RO)
+------------------
+This has different meaning depending on the type of the block device.
+For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
+of the RAID volume stripe segment. For a zoned block device, either host-aware
+or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
+of the device, with the eventual exception of the last zone of the device which
+may be smaller.
+
dax (RO)
--------
This file indicates whether the device supports Direct Access (DAX),
smaller discards and potentially help reduce latencies induced by large
discard operations.
+discard_zeroes_data (RO)
+------------------------
+Obsolete. Always zero.
+
+fua (RO)
+--------
+Whether or not the block driver supports the FUA flag for write requests.
+FUA stands for Force Unit Access. If the FUA flag is set that means that
+write requests must bypass the volatile cache of the storage device.
+
hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.
-----------------------
This is the logical block size of the device, in bytes.
+max_discard_segments (RO)
+-------------------------
+The maximum number of DMA scatter/gather entries in a discard request.
+
max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.
max_integrity_segments (RO)
---------------------------
-When read, this file shows the max limit of integrity segments as
-set by block layer which a hardware controller can handle.
+Maximum number of elements in a DMA scatter/gather list with integrity
+data that will be submitted by the block layer core to the associated
+block driver.
max_sectors_kb (RW)
-------------------
max_segments (RO)
-----------------
-Maximum number of segments of the device.
+Maximum number of elements in a DMA scatter/gather list that is submitted
+to the associated block driver.
max_segment_size (RO)
---------------------
-Maximum segment size of the device.
+Maximum size in bytes of a single element in a DMA scatter/gather list.
minimum_io_size (RO)
--------------------
each request queue may have up to N request pools, each independently
regulated by nr_requests.
+nr_zones (RO)
+-------------
+For zoned block devices (zoned attribute indicating "host-managed" or
+"host-aware"), this indicates the total number of zones of the device.
+This is always 0 for regular block devices.
+
optimal_io_size (RO)
--------------------
This is the optimal IO size reported by the device.
command. A value of '0' means write-same is not supported by this
device.
-wb_lat_usec (RW)
-----------------
+wbt_lat_usec (RW)
+-----------------
If the device is registered for writeback throttling, then this file shows
the target minimum read latency. If this latency is exceeded in a given
window of time (see wb_window_usec), then the writeback throttling will start
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
+write_zeroes_max_bytes (RO)
+---------------------------
+For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of
+bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES
+is not supported.
+
zoned (RO)
----------
This indicates if the device is a zoned block device and the zone model of the
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".
-nr_zones (RO)
--------------
-For zoned block devices (zoned attribute indicating "host-managed" or
-"host-aware"), this indicates the total number of zones of the device.
-This is always 0 for regular block devices.
-
-chunk_sectors (RO)
-------------------
-This has different meaning depending on the type of the block device.
-For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
-of the RAID volume stripe segment. For a zoned block device, either host-aware
-or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
-of the device, with the eventual exception of the last zone of the device which
-may be smaller.
-
Jens Axboe <jens.axboe@oracle.com>, February 2009
CONFIG_BLK_CGROUP
- Block IO controller.
-CONFIG_DEBUG_BLK_CGROUP
+CONFIG_BFQ_CGROUP_DEBUG
- Debug help. Right now some additional stats file show up in cgroup
if this option is enabled.
write, sync or async.
- blkio.avg_queue_size
- - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
The average queue size for this cgroup over the entire time of this
cgroup's existence. Queue size samples are taken each time one of the
queues of this cgroup gets a timeslice.
- blkio.group_wait_time
- - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time the cgroup had to wait since it became busy
(i.e., went from 0 to 1 request queued) to get a timeslice for one of
its queues. This is different from the io_wait_time which is the
got a timeslice and will not include the current delta.
- blkio.empty_time
- - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time a cgroup spends without any pending
requests when not being served, i.e., it does not include any time
spent idling for one of the queues of the cgroup. This is in
time it had a pending request and will not include the current delta.
- blkio.idle_time
- - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time spent by the IO scheduler idling for a
given cgroup in anticipation of a better request than the existing ones
from other queues/cgroups. This is in nanoseconds. If this is read
the current delta.
- blkio.dequeue
- - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
gives the statistics about how many a times a group was dequeued
from service tree of the device. First two fields specify the major
and minor number of the device and third field specifies the number
cpu_startup_entry+0x6f/0x80
start_secondary+0x187/0x1e0
secondary_startup_64+0xa5/0xb0
+
+Example 3: Inject an error into the 10th admin command
+------------------------------------------------------
+
+echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
+echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
+echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
+nvme reset /dev/nvme0
+
+Expected Result:
+
+After NVMe controller reset, the reinitialization may or may not succeed.
+It depends on which admin command is actually forced to fail.
+
+Message from dmesg:
+
+nvme nvme0: resetting controller
+FAULT_INJECTION: forcing a failure.
+name fault_inject, interval 1, probability 100, space 1, times 1
+CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc2+ #2
+Hardware name: MSI MS-7A45/B150M MORTAR ARCTIC (MS-7A45), BIOS 1.50 04/25/2017
+Call Trace:
+ <IRQ>
+ dump_stack+0x63/0x85
+ should_fail+0x14a/0x170
+ nvme_should_fail+0x38/0x80 [nvme_core]
+ nvme_irq+0x129/0x280 [nvme]
+ ? blk_mq_end_request+0xb3/0x120
+ __handle_irq_event_percpu+0x84/0x1a0
+ handle_irq_event_percpu+0x32/0x80
+ handle_irq_event+0x3b/0x60
+ handle_edge_irq+0x7f/0x1a0
+ handle_irq+0x20/0x30
+ do_IRQ+0x4e/0xe0
+ common_interrupt+0xf/0xf
+ </IRQ>
+RIP: 0010:cpuidle_enter_state+0xc5/0x460
+Code: ff e8 8f 5f 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 69 03 00 00 31 ff e8 62 aa 8c ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 37 03 00 00 4c 8b 45 d0 4c 2b 45 b8 48 ba cf f7 53
+RSP: 0018:ffffffff88c03dd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
+RAX: ffff9dac25a2ac80 RBX: ffffffff88d53760 RCX: 000000000000001f
+RDX: 0000000000000000 RSI: 000000002d958403 RDI: 0000000000000000
+RBP: ffffffff88c03e18 R08: fffffff75e35ffb7 R09: 00000a49a56c0b48
+R10: ffffffff88c03da0 R11: 0000000000001b0c R12: ffff9dac25a34d00
+R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff88d53760
+ cpuidle_enter+0x2e/0x40
+ call_cpuidle+0x23/0x40
+ do_idle+0x201/0x280
+ cpu_startup_entry+0x1d/0x20
+ rest_init+0xaa/0xb0
+ arch_call_rest_init+0xe/0x1b
+ start_kernel+0x51c/0x53b
+ x86_64_start_reservations+0x24/0x26
+ x86_64_start_kernel+0x74/0x77
+ secondary_startup_64+0xa4/0xb0
+nvme nvme0: Could not set queue count (16385)
+nvme nvme0: IO queues not created
Enable hierarchical scheduling in BFQ, using the blkio
(cgroups-v1) or io (cgroups-v2) controller.
+config BFQ_CGROUP_DEBUG
+ bool "BFQ IO controller debugging"
+ depends on BFQ_GROUP_IOSCHED
+ ---help---
+ Enable some debugging help. Currently it exports additional stat
+ files in a cgroup which can be useful for debugging.
+
endmenu
endif
#include "bfq-iosched.h"
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
+static int bfq_stat_init(struct bfq_stat *stat, gfp_t gfp)
+{
+ int ret;
+
+ ret = percpu_counter_init(&stat->cpu_cnt, 0, gfp);
+ if (ret)
+ return ret;
+
+ atomic64_set(&stat->aux_cnt, 0);
+ return 0;
+}
+
+static void bfq_stat_exit(struct bfq_stat *stat)
+{
+ percpu_counter_destroy(&stat->cpu_cnt);
+}
+
+/**
+ * bfq_stat_add - add a value to a bfq_stat
+ * @stat: target bfq_stat
+ * @val: value to add
+ *
+ * Add @val to @stat. The caller must ensure that IRQ on the same CPU
+ * don't re-enter this function for the same counter.
+ */
+static inline void bfq_stat_add(struct bfq_stat *stat, uint64_t val)
+{
+ percpu_counter_add_batch(&stat->cpu_cnt, val, BLKG_STAT_CPU_BATCH);
+}
+
+/**
+ * bfq_stat_read - read the current value of a bfq_stat
+ * @stat: bfq_stat to read
+ */
+static inline uint64_t bfq_stat_read(struct bfq_stat *stat)
+{
+ return percpu_counter_sum_positive(&stat->cpu_cnt);
+}
+
+/**
+ * bfq_stat_reset - reset a bfq_stat
+ * @stat: bfq_stat to reset
+ */
+static inline void bfq_stat_reset(struct bfq_stat *stat)
+{
+ percpu_counter_set(&stat->cpu_cnt, 0);
+ atomic64_set(&stat->aux_cnt, 0);
+}
+
+/**
+ * bfq_stat_add_aux - add a bfq_stat into another's aux count
+ * @to: the destination bfq_stat
+ * @from: the source
+ *
+ * Add @from's count including the aux one to @to's aux count.
+ */
+static inline void bfq_stat_add_aux(struct bfq_stat *to,
+ struct bfq_stat *from)
+{
+ atomic64_add(bfq_stat_read(from) + atomic64_read(&from->aux_cnt),
+ &to->aux_cnt);
+}
+
+/**
+ * blkg_prfill_stat - prfill callback for bfq_stat
+ * @sf: seq_file to print to
+ * @pd: policy private data of interest
+ * @off: offset to the bfq_stat in @pd
+ *
+ * prfill callback for printing a bfq_stat.
+ */
+static u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd,
+ int off)
+{
+ return __blkg_prfill_u64(sf, pd, bfq_stat_read((void *)pd + off));
+}
/* bfqg stats flags */
enum bfqg_stats_flags {
now = ktime_get_ns();
if (now > stats->start_group_wait_time)
- blkg_stat_add(&stats->group_wait_time,
+ bfq_stat_add(&stats->group_wait_time,
now - stats->start_group_wait_time);
bfqg_stats_clear_waiting(stats);
}
now = ktime_get_ns();
if (now > stats->start_empty_time)
- blkg_stat_add(&stats->empty_time,
+ bfq_stat_add(&stats->empty_time,
now - stats->start_empty_time);
bfqg_stats_clear_empty(stats);
}
void bfqg_stats_update_dequeue(struct bfq_group *bfqg)
{
- blkg_stat_add(&bfqg->stats.dequeue, 1);
+ bfq_stat_add(&bfqg->stats.dequeue, 1);
}
void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg)
u64 now = ktime_get_ns();
if (now > stats->start_idle_time)
- blkg_stat_add(&stats->idle_time,
+ bfq_stat_add(&stats->idle_time,
now - stats->start_idle_time);
bfqg_stats_clear_idling(stats);
}
{
struct bfqg_stats *stats = &bfqg->stats;
- blkg_stat_add(&stats->avg_queue_size_sum,
+ bfq_stat_add(&stats->avg_queue_size_sum,
blkg_rwstat_total(&stats->queued));
- blkg_stat_add(&stats->avg_queue_size_samples, 1);
+ bfq_stat_add(&stats->avg_queue_size_samples, 1);
bfqg_stats_update_group_wait_time(stats);
}
io_start_time_ns - start_time_ns);
}
-#else /* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */
+#else /* CONFIG_BFQ_CGROUP_DEBUG */
void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq,
unsigned int op) { }
void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { }
void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { }
-#endif /* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
#ifdef CONFIG_BFQ_GROUP_IOSCHED
/* @stats = 0 */
static void bfqg_stats_reset(struct bfqg_stats *stats)
{
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
/* queued stats shouldn't be cleared */
blkg_rwstat_reset(&stats->merged);
blkg_rwstat_reset(&stats->service_time);
blkg_rwstat_reset(&stats->wait_time);
- blkg_stat_reset(&stats->time);
- blkg_stat_reset(&stats->avg_queue_size_sum);
- blkg_stat_reset(&stats->avg_queue_size_samples);
- blkg_stat_reset(&stats->dequeue);
- blkg_stat_reset(&stats->group_wait_time);
- blkg_stat_reset(&stats->idle_time);
- blkg_stat_reset(&stats->empty_time);
+ bfq_stat_reset(&stats->time);
+ bfq_stat_reset(&stats->avg_queue_size_sum);
+ bfq_stat_reset(&stats->avg_queue_size_samples);
+ bfq_stat_reset(&stats->dequeue);
+ bfq_stat_reset(&stats->group_wait_time);
+ bfq_stat_reset(&stats->idle_time);
+ bfq_stat_reset(&stats->empty_time);
#endif
}
if (!to || !from)
return;
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
/* queued stats shouldn't be cleared */
blkg_rwstat_add_aux(&to->merged, &from->merged);
blkg_rwstat_add_aux(&to->service_time, &from->service_time);
blkg_rwstat_add_aux(&to->wait_time, &from->wait_time);
- blkg_stat_add_aux(&from->time, &from->time);
- blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
- blkg_stat_add_aux(&to->avg_queue_size_samples,
+ bfq_stat_add_aux(&from->time, &from->time);
+ bfq_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
+ bfq_stat_add_aux(&to->avg_queue_size_samples,
&from->avg_queue_size_samples);
- blkg_stat_add_aux(&to->dequeue, &from->dequeue);
- blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time);
- blkg_stat_add_aux(&to->idle_time, &from->idle_time);
- blkg_stat_add_aux(&to->empty_time, &from->empty_time);
+ bfq_stat_add_aux(&to->dequeue, &from->dequeue);
+ bfq_stat_add_aux(&to->group_wait_time, &from->group_wait_time);
+ bfq_stat_add_aux(&to->idle_time, &from->idle_time);
+ bfq_stat_add_aux(&to->empty_time, &from->empty_time);
#endif
}
static void bfqg_stats_exit(struct bfqg_stats *stats)
{
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
blkg_rwstat_exit(&stats->merged);
blkg_rwstat_exit(&stats->service_time);
blkg_rwstat_exit(&stats->wait_time);
blkg_rwstat_exit(&stats->queued);
- blkg_stat_exit(&stats->time);
- blkg_stat_exit(&stats->avg_queue_size_sum);
- blkg_stat_exit(&stats->avg_queue_size_samples);
- blkg_stat_exit(&stats->dequeue);
- blkg_stat_exit(&stats->group_wait_time);
- blkg_stat_exit(&stats->idle_time);
- blkg_stat_exit(&stats->empty_time);
+ bfq_stat_exit(&stats->time);
+ bfq_stat_exit(&stats->avg_queue_size_sum);
+ bfq_stat_exit(&stats->avg_queue_size_samples);
+ bfq_stat_exit(&stats->dequeue);
+ bfq_stat_exit(&stats->group_wait_time);
+ bfq_stat_exit(&stats->idle_time);
+ bfq_stat_exit(&stats->empty_time);
#endif
}
static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp)
{
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
if (blkg_rwstat_init(&stats->merged, gfp) ||
blkg_rwstat_init(&stats->service_time, gfp) ||
blkg_rwstat_init(&stats->wait_time, gfp) ||
blkg_rwstat_init(&stats->queued, gfp) ||
- blkg_stat_init(&stats->time, gfp) ||
- blkg_stat_init(&stats->avg_queue_size_sum, gfp) ||
- blkg_stat_init(&stats->avg_queue_size_samples, gfp) ||
- blkg_stat_init(&stats->dequeue, gfp) ||
- blkg_stat_init(&stats->group_wait_time, gfp) ||
- blkg_stat_init(&stats->idle_time, gfp) ||
- blkg_stat_init(&stats->empty_time, gfp)) {
+ bfq_stat_init(&stats->time, gfp) ||
+ bfq_stat_init(&stats->avg_queue_size_sum, gfp) ||
+ bfq_stat_init(&stats->avg_queue_size_samples, gfp) ||
+ bfq_stat_init(&stats->dequeue, gfp) ||
+ bfq_stat_init(&stats->group_wait_time, gfp) ||
+ bfq_stat_init(&stats->idle_time, gfp) ||
+ bfq_stat_init(&stats->empty_time, gfp)) {
bfqg_stats_exit(stats);
return -ENOMEM;
}
return ret ?: nbytes;
}
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
static int bfqg_print_stat(struct seq_file *sf, void *v)
{
blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat,
static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
- u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd),
- &blkcg_policy_bfq, off);
+ struct blkcg_gq *blkg = pd_to_blkg(pd);
+ struct blkcg_gq *pos_blkg;
+ struct cgroup_subsys_state *pos_css;
+ u64 sum = 0;
+
+ lockdep_assert_held(&blkg->q->queue_lock);
+
+ rcu_read_lock();
+ blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
+ struct bfq_stat *stat;
+
+ if (!pos_blkg->online)
+ continue;
+
+ stat = (void *)blkg_to_pd(pos_blkg, &blkcg_policy_bfq) + off;
+ sum += bfq_stat_read(stat) + atomic64_read(&stat->aux_cnt);
+ }
+ rcu_read_unlock();
+
return __blkg_prfill_u64(sf, pd, sum);
}
static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
- struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd),
- &blkcg_policy_bfq,
- off);
+ struct blkg_rwstat_sample sum;
+
+ blkg_rwstat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, off, &sum);
return __blkg_prfill_rwstat(sf, pd, &sum);
}
static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
- struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL,
- offsetof(struct blkcg_gq, stat_bytes));
- u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) +
- atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]);
+ struct blkg_rwstat_sample tmp;
- return __blkg_prfill_u64(sf, pd, sum >> 9);
+ blkg_rwstat_recursive_sum(pd->blkg, NULL,
+ offsetof(struct blkcg_gq, stat_bytes), &tmp);
+
+ return __blkg_prfill_u64(sf, pd,
+ (tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE]) >> 9);
}
static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v)
struct blkg_policy_data *pd, int off)
{
struct bfq_group *bfqg = pd_to_bfqg(pd);
- u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples);
+ u64 samples = bfq_stat_read(&bfqg->stats.avg_queue_size_samples);
u64 v = 0;
if (samples) {
- v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum);
+ v = bfq_stat_read(&bfqg->stats.avg_queue_size_sum);
v = div64_u64(v, samples);
}
__blkg_prfill_u64(sf, pd, v);
0, false);
return 0;
}
-#endif /* CONFIG_DEBUG_BLK_CGROUP */
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
{
.private = (unsigned long)&blkcg_policy_bfq,
.seq_show = blkg_print_stat_ios,
},
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
{
.name = "bfq.time",
.private = offsetof(struct bfq_group, stats.time),
.private = offsetof(struct bfq_group, stats.queued),
.seq_show = bfqg_print_rwstat,
},
-#endif /* CONFIG_DEBUG_BLK_CGROUP */
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
/* the same statistics which cover the bfqg and its descendants */
{
.private = (unsigned long)&blkcg_policy_bfq,
.seq_show = blkg_print_stat_ios_recursive,
},
-#ifdef CONFIG_DEBUG_BLK_CGROUP
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
{
.name = "bfq.time_recursive",
.private = offsetof(struct bfq_group, stats.time),
.private = offsetof(struct bfq_group, stats.dequeue),
.seq_show = bfqg_print_stat,
},
-#endif /* CONFIG_DEBUG_BLK_CGROUP */
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
{ } /* terminate */
};
BFQ_BFQQ_FNS(coop);
BFQ_BFQQ_FNS(split_coop);
BFQ_BFQQ_FNS(softrt_update);
+BFQ_BFQQ_FNS(has_waker);
#undef BFQ_BFQQ_FNS \
/* Expiration time of sync (0) and async (1) requests, in ns. */
* mechanism may be re-designed in such a way to make it possible to
* know whether preemption is needed without needing to update service
* trees). In addition, queue preemptions almost always cause random
- * I/O, and thus loss of throughput. Because of these facts, the next
- * function adopts the following simple scheme to avoid both costly
- * operations and too frequent preemptions: it requests the expiration
- * of the in-service queue (unconditionally) only for queues that need
- * to recover a hole, or that either are weight-raised or deserve to
- * be weight-raised.
+ * I/O, which may in turn cause loss of throughput. Finally, there may
+ * even be no in-service queue when the next function is invoked (so,
+ * no queue to compare timestamps with). Because of these facts, the
+ * next function adopts the following simple scheme to avoid costly
+ * operations, too frequent preemptions and too many dependencies on
+ * the state of the scheduler: it requests the expiration of the
+ * in-service queue (unconditionally) only for queues that need to
+ * recover a hole. Then it delegates to other parts of the code the
+ * responsibility of handling the above case 2.
*/
static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
- bool arrived_in_time,
- bool wr_or_deserves_wr)
+ bool arrived_in_time)
{
struct bfq_entity *entity = &bfqq->entity;
entity->budget = max_t(unsigned long, bfqq->max_budget,
bfq_serv_to_charge(bfqq->next_rq, bfqq));
bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
- return wr_or_deserves_wr;
+ return false;
}
/*
bfqd->bfq_wr_min_idle_time);
}
+
+/*
+ * Return true if bfqq is in a higher priority class, or has a higher
+ * weight than the in-service queue.
+ */
+static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,
+ struct bfq_queue *in_serv_bfqq)
+{
+ int bfqq_weight, in_serv_weight;
+
+ if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)
+ return true;
+
+ if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {
+ bfqq_weight = bfqq->entity.weight;
+ in_serv_weight = in_serv_bfqq->entity.weight;
+ } else {
+ if (bfqq->entity.parent)
+ bfqq_weight = bfqq->entity.parent->weight;
+ else
+ bfqq_weight = bfqq->entity.weight;
+ if (in_serv_bfqq->entity.parent)
+ in_serv_weight = in_serv_bfqq->entity.parent->weight;
+ else
+ in_serv_weight = in_serv_bfqq->entity.weight;
+ }
+
+ return bfqq_weight > in_serv_weight;
+}
+
static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
int old_wr_coeff,
*/
bfqq_wants_to_preempt =
bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
- arrived_in_time,
- wr_or_deserves_wr);
+ arrived_in_time);
/*
* If bfqq happened to be activated in a burst, but has been
/*
* Expire in-service queue only if preemption may be needed
- * for guarantees. In this respect, the function
- * next_queue_may_preempt just checks a simple, necessary
- * condition, and not a sufficient condition based on
- * timestamps. In fact, for the latter condition to be
- * evaluated, timestamps would need first to be updated, and
- * this operation is quite costly (see the comments on the
- * function bfq_bfqq_update_budg_for_activation).
+ * for guarantees. In particular, we care only about two
+ * cases. The first is that bfqq has to recover a service
+ * hole, as explained in the comments on
+ * bfq_bfqq_update_budg_for_activation(), i.e., that
+ * bfqq_wants_to_preempt is true. However, if bfqq does not
+ * carry time-critical I/O, then bfqq's bandwidth is less
+ * important than that of queues that carry time-critical I/O.
+ * So, as a further constraint, we consider this case only if
+ * bfqq is at least as weight-raised, i.e., at least as time
+ * critical, as the in-service queue.
+ *
+ * The second case is that bfqq is in a higher priority class,
+ * or has a higher weight than the in-service queue. If this
+ * condition does not hold, we don't care because, even if
+ * bfqq does not start to be served immediately, the resulting
+ * delay for bfqq's I/O is however lower or much lower than
+ * the ideal completion time to be guaranteed to bfqq's I/O.
+ *
+ * In both cases, preemption is needed only if, according to
+ * the timestamps of both bfqq and of the in-service queue,
+ * bfqq actually is the next queue to serve. So, to reduce
+ * useless preemptions, the return value of
+ * next_queue_may_preempt() is considered in the next compound
+ * condition too. Yet next_queue_may_preempt() just checks a
+ * simple, necessary condition for bfqq to be the next queue
+ * to serve. In fact, to evaluate a sufficient condition, the
+ * timestamps of the in-service queue would need to be
+ * updated, and this operation is quite costly (see the
+ * comments on bfq_bfqq_update_budg_for_activation()).
*/
- if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
- bfqd->in_service_queue->wr_coeff < bfqq->wr_coeff &&
+ if (bfqd->in_service_queue &&
+ ((bfqq_wants_to_preempt &&
+ bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) ||
+ bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue)) &&
next_queue_may_preempt(bfqd))
bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
false, BFQQE_PREEMPTED);
}
+static void bfq_reset_inject_limit(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ /* invalidate baseline total service time */
+ bfqq->last_serv_time_ns = 0;
+
+ /*
+ * Reset pointer in case we are waiting for
+ * some request completion.
+ */
+ bfqd->waited_rq = NULL;
+
+ /*
+ * If bfqq has a short think time, then start by setting the
+ * inject limit to 0 prudentially, because the service time of
+ * an injected I/O request may be higher than the think time
+ * of bfqq, and therefore, if one request was injected when
+ * bfqq remains empty, this injected request might delay the
+ * service of the next I/O request for bfqq significantly. In
+ * case bfqq can actually tolerate some injection, then the
+ * adaptive update will however raise the limit soon. This
+ * lucky circumstance holds exactly because bfqq has a short
+ * think time, and thus, after remaining empty, is likely to
+ * get new I/O enqueued---and then completed---before being
+ * expired. This is the very pattern that gives the
+ * limit-update algorithm the chance to measure the effect of
+ * injection on request service times, and then to update the
+ * limit accordingly.
+ *
+ * However, in the following special case, the inject limit is
+ * left to 1 even if the think time is short: bfqq's I/O is
+ * synchronized with that of some other queue, i.e., bfqq may
+ * receive new I/O only after the I/O of the other queue is
+ * completed. Keeping the inject limit to 1 allows the
+ * blocking I/O to be served while bfqq is in service. And
+ * this is very convenient both for bfqq and for overall
+ * throughput, as explained in detail in the comments in
+ * bfq_update_has_short_ttime().
+ *
+ * On the opposite end, if bfqq has a long think time, then
+ * start directly by 1, because:
+ * a) on the bright side, keeping at most one request in
+ * service in the drive is unlikely to cause any harm to the
+ * latency of bfqq's requests, as the service time of a single
+ * request is likely to be lower than the think time of bfqq;
+ * b) on the downside, after becoming empty, bfqq is likely to
+ * expire before getting its next request. With this request
+ * arrival pattern, it is very hard to sample total service
+ * times and update the inject limit accordingly (see comments
+ * on bfq_update_inject_limit()). So the limit is likely to be
+ * never, or at least seldom, updated. As a consequence, by
+ * setting the limit to 1, we avoid that no injection ever
+ * occurs with bfqq. On the downside, this proactive step
+ * further reduces chances to actually compute the baseline
+ * total service time. Thus it reduces chances to execute the
+ * limit-update algorithm and possibly raise the limit to more
+ * than 1.
+ */
+ if (bfq_bfqq_has_short_ttime(bfqq))
+ bfqq->inject_limit = 0;
+ else
+ bfqq->inject_limit = 1;
+
+ bfqq->decrease_time_jif = jiffies;
+}
+
static void bfq_add_request(struct request *rq)
{
struct bfq_queue *bfqq = RQ_BFQQ(rq);
if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) {
/*
+ * Detect whether bfqq's I/O seems synchronized with
+ * that of some other queue, i.e., whether bfqq, after
+ * remaining empty, happens to receive new I/O only
+ * right after some I/O request of the other queue has
+ * been completed. We call waker queue the other
+ * queue, and we assume, for simplicity, that bfqq may
+ * have at most one waker queue.
+ *
+ * A remarkable throughput boost can be reached by
+ * unconditionally injecting the I/O of the waker
+ * queue, every time a new bfq_dispatch_request
+ * happens to be invoked while I/O is being plugged
+ * for bfqq. In addition to boosting throughput, this
+ * unblocks bfqq's I/O, thereby improving bandwidth
+ * and latency for bfqq. Note that these same results
+ * may be achieved with the general injection
+ * mechanism, but less effectively. For details on
+ * this aspect, see the comments on the choice of the
+ * queue for injection in bfq_select_queue().
+ *
+ * Turning back to the detection of a waker queue, a
+ * queue Q is deemed as a waker queue for bfqq if, for
+ * two consecutive times, bfqq happens to become non
+ * empty right after a request of Q has been
+ * completed. In particular, on the first time, Q is
+ * tentatively set as a candidate waker queue, while
+ * on the second time, the flag
+ * bfq_bfqq_has_waker(bfqq) is set to confirm that Q
+ * is a waker queue for bfqq. These detection steps
+ * are performed only if bfqq has a long think time,
+ * so as to make it more likely that bfqq's I/O is
+ * actually being blocked by a synchronization. This
+ * last filter, plus the above two-times requirement,
+ * make false positives less likely.
+ *
+ * NOTE
+ *
+ * The sooner a waker queue is detected, the sooner
+ * throughput can be boosted by injecting I/O from the
+ * waker queue. Fortunately, detection is likely to be
+ * actually fast, for the following reasons. While
+ * blocked by synchronization, bfqq has a long think
+ * time. This implies that bfqq's inject limit is at
+ * least equal to 1 (see the comments in
+ * bfq_update_inject_limit()). So, thanks to
+ * injection, the waker queue is likely to be served
+ * during the very first I/O-plugging time interval
+ * for bfqq. This triggers the first step of the
+ * detection mechanism. Thanks again to injection, the
+ * candidate waker queue is then likely to be
+ * confirmed no later than during the next
+ * I/O-plugging interval for bfqq.
+ */
+ if (!bfq_bfqq_has_short_ttime(bfqq) &&
+ ktime_get_ns() - bfqd->last_completion <
+ 200 * NSEC_PER_USEC) {
+ if (bfqd->last_completed_rq_bfqq != bfqq &&
+ bfqd->last_completed_rq_bfqq !=
+ bfqq->waker_bfqq) {
+ /*
+ * First synchronization detected with
+ * a candidate waker queue, or with a
+ * different candidate waker queue
+ * from the current one.
+ */
+ bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;
+
+ /*
+ * If the waker queue disappears, then
+ * bfqq->waker_bfqq must be reset. To
+ * this goal, we maintain in each
+ * waker queue a list, woken_list, of
+ * all the queues that reference the
+ * waker queue through their
+ * waker_bfqq pointer. When the waker
+ * queue exits, the waker_bfqq pointer
+ * of all the queues in the woken_list
+ * is reset.
+ *
+ * In addition, if bfqq is already in
+ * the woken_list of a waker queue,
+ * then, before being inserted into
+ * the woken_list of a new waker
+ * queue, bfqq must be removed from
+ * the woken_list of the old waker
+ * queue.
+ */
+ if (!hlist_unhashed(&bfqq->woken_list_node))
+ hlist_del_init(&bfqq->woken_list_node);
+ hlist_add_head(&bfqq->woken_list_node,
+ &bfqd->last_completed_rq_bfqq->woken_list);
+
+ bfq_clear_bfqq_has_waker(bfqq);
+ } else if (bfqd->last_completed_rq_bfqq ==
+ bfqq->waker_bfqq &&
+ !bfq_bfqq_has_waker(bfqq)) {
+ /*
+ * synchronization with waker_bfqq
+ * seen for the second time
+ */
+ bfq_mark_bfqq_has_waker(bfqq);
+ }
+ }
+
+ /*
* Periodically reset inject limit, to make sure that
* the latter eventually drops in case workload
* changes, see step (3) in the comments on
* bfq_update_inject_limit().
*/
if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
- msecs_to_jiffies(1000))) {
- /* invalidate baseline total service time */
- bfqq->last_serv_time_ns = 0;
-
- /*
- * Reset pointer in case we are waiting for
- * some request completion.
- */
- bfqd->waited_rq = NULL;
-
- /*
- * If bfqq has a short think time, then start
- * by setting the inject limit to 0
- * prudentially, because the service time of
- * an injected I/O request may be higher than
- * the think time of bfqq, and therefore, if
- * one request was injected when bfqq remains
- * empty, this injected request might delay
- * the service of the next I/O request for
- * bfqq significantly. In case bfqq can
- * actually tolerate some injection, then the
- * adaptive update will however raise the
- * limit soon. This lucky circumstance holds
- * exactly because bfqq has a short think
- * time, and thus, after remaining empty, is
- * likely to get new I/O enqueued---and then
- * completed---before being expired. This is
- * the very pattern that gives the
- * limit-update algorithm the chance to
- * measure the effect of injection on request
- * service times, and then to update the limit
- * accordingly.
- *
- * On the opposite end, if bfqq has a long
- * think time, then start directly by 1,
- * because:
- * a) on the bright side, keeping at most one
- * request in service in the drive is unlikely
- * to cause any harm to the latency of bfqq's
- * requests, as the service time of a single
- * request is likely to be lower than the
- * think time of bfqq;
- * b) on the downside, after becoming empty,
- * bfqq is likely to expire before getting its
- * next request. With this request arrival
- * pattern, it is very hard to sample total
- * service times and update the inject limit
- * accordingly (see comments on
- * bfq_update_inject_limit()). So the limit is
- * likely to be never, or at least seldom,
- * updated. As a consequence, by setting the
- * limit to 1, we avoid that no injection ever
- * occurs with bfqq. On the downside, this
- * proactive step further reduces chances to
- * actually compute the baseline total service
- * time. Thus it reduces chances to execute the
- * limit-update algorithm and possibly raise the
- * limit to more than 1.
- */
- if (bfq_bfqq_has_short_ttime(bfqq))
- bfqq->inject_limit = 0;
- else
- bfqq->inject_limit = 1;
- bfqq->decrease_time_jif = jiffies;
- }
+ msecs_to_jiffies(1000)))
+ bfq_reset_inject_limit(bfqd, bfqq);
/*
* The following conditions must hold to setup a new
}
-static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+ unsigned int nr_segs)
{
struct request_queue *q = hctx->queue;
struct bfq_data *bfqd = q->elevator->elevator_data;
bfqd->bio_bfqq = NULL;
bfqd->bio_bic = bic;
- ret = blk_mq_sched_try_merge(q, bio, &free);
+ ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
if (free)
blk_mq_free_request(free);
* to enjoy weight raising if split soon.
*/
bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+ bic->saved_wr_start_at_switch_to_srt = bfq_smallest_from_now();
bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd);
bic->saved_last_wr_start_finish = jiffies;
} else {
bfq_remove_request(q, rq);
}
-static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * There is a case where idling does not have to be performed for
+ * throughput concerns, but to preserve the throughput share of
+ * the process associated with bfqq.
+ *
+ * To introduce this case, we can note that allowing the drive
+ * to enqueue more than one request at a time, and hence
+ * delegating de facto final scheduling decisions to the
+ * drive's internal scheduler, entails loss of control on the
+ * actual request service order. In particular, the critical
+ * situation is when requests from different processes happen
+ * to be present, at the same time, in the internal queue(s)
+ * of the drive. In such a situation, the drive, by deciding
+ * the service order of the internally-queued requests, does
+ * determine also the actual throughput distribution among
+ * these processes. But the drive typically has no notion or
+ * concern about per-process throughput distribution, and
+ * makes its decisions only on a per-request basis. Therefore,
+ * the service distribution enforced by the drive's internal
+ * scheduler is likely to coincide with the desired throughput
+ * distribution only in a completely symmetric, or favorably
+ * skewed scenario where:
+ * (i-a) each of these processes must get the same throughput as
+ * the others,
+ * (i-b) in case (i-a) does not hold, it holds that the process
+ * associated with bfqq must receive a lower or equal
+ * throughput than any of the other processes;
+ * (ii) the I/O of each process has the same properties, in
+ * terms of locality (sequential or random), direction
+ * (reads or writes), request sizes, greediness
+ * (from I/O-bound to sporadic), and so on;
+
+ * In fact, in such a scenario, the drive tends to treat the requests
+ * of each process in about the same way as the requests of the
+ * others, and thus to provide each of these processes with about the
+ * same throughput. This is exactly the desired throughput
+ * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
+ * even more convenient distribution for (the process associated with)
+ * bfqq.
+ *
+ * In contrast, in any asymmetric or unfavorable scenario, device
+ * idling (I/O-dispatch plugging) is certainly needed to guarantee
+ * that bfqq receives its assigned fraction of the device throughput
+ * (see [1] for details).
+ *
+ * The problem is that idling may significantly reduce throughput with
+ * certain combinations of types of I/O and devices. An important
+ * example is sync random I/O on flash storage with command
+ * queueing. So, unless bfqq falls in cases where idling also boosts
+ * throughput, it is important to check conditions (i-a), i(-b) and
+ * (ii) accurately, so as to avoid idling when not strictly needed for
+ * service guarantees.
+ *
+ * Unfortunately, it is extremely difficult to thoroughly check
+ * condition (ii). And, in case there are active groups, it becomes
+ * very difficult to check conditions (i-a) and (i-b) too. In fact,
+ * if there are active groups, then, for conditions (i-a) or (i-b) to
+ * become false 'indirectly', it is enough that an active group
+ * contains more active processes or sub-groups than some other active
+ * group. More precisely, for conditions (i-a) or (i-b) to become
+ * false because of such a group, it is not even necessary that the
+ * group is (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still have
+ * some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed their
+ * share of the throughput even after being dispatched. In this
+ * respect, it is easy to show that, if a group frequently becomes
+ * inactive while still having in-flight requests, and if, when this
+ * happens, the group is not considered in the calculation of whether
+ * the scenario is asymmetric, then the group may fail to be
+ * guaranteed its fair share of the throughput (basically because
+ * idling may not be performed for the descendant processes of the
+ * group, but it had to be). We address this issue with the following
+ * bi-modal behavior, implemented in the function
+ * bfq_asymmetric_scenario().
+ *
+ * If there are groups with requests waiting for completion
+ * (as commented above, some of these groups may even be
+ * already inactive), then the scenario is tagged as
+ * asymmetric, conservatively, without checking any of the
+ * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
+ * This behavior matches also the fact that groups are created
+ * exactly if controlling I/O is a primary concern (to
+ * preserve bandwidth and latency guarantees).
+ *
+ * On the opposite end, if there are no groups with requests waiting
+ * for completion, then only conditions (i-a) and (i-b) are actually
+ * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
+ * idling is not performed, regardless of whether condition (ii)
+ * holds. In other words, only if conditions (i-a) and (i-b) do not
+ * hold, then idling is allowed, and the device tends to be prevented
+ * from queueing many requests, possibly of several processes. Since
+ * there are no groups with requests waiting for completion, then, to
+ * control conditions (i-a) and (i-b) it is enough to check just
+ * whether all the queues with requests waiting for completion also
+ * have the same weight.
+ *
+ * Not checking condition (ii) evidently exposes bfqq to the
+ * risk of getting less throughput than its fair share.
+ * However, for queues with the same weight, a further
+ * mechanism, preemption, mitigates or even eliminates this
+ * problem. And it does so without consequences on overall
+ * throughput. This mechanism and its benefits are explained
+ * in the next three paragraphs.
+ *
+ * Even if a queue, say Q, is expired when it remains idle, Q
+ * can still preempt the new in-service queue if the next
+ * request of Q arrives soon (see the comments on
+ * bfq_bfqq_update_budg_for_activation). If all queues and
+ * groups have the same weight, this form of preemption,
+ * combined with the hole-recovery heuristic described in the
+ * comments on function bfq_bfqq_update_budg_for_activation,
+ * are enough to preserve a correct bandwidth distribution in
+ * the mid term, even without idling. In fact, even if not
+ * idling allows the internal queues of the device to contain
+ * many requests, and thus to reorder requests, we can rather
+ * safely assume that the internal scheduler still preserves a
+ * minimum of mid-term fairness.
+ *
+ * More precisely, this preemption-based, idleless approach
+ * provides fairness in terms of IOPS, and not sectors per
+ * second. This can be seen with a simple example. Suppose
+ * that there are two queues with the same weight, but that
+ * the first queue receives requests of 8 sectors, while the
+ * second queue receives requests of 1024 sectors. In
+ * addition, suppose that each of the two queues contains at
+ * most one request at a time, which implies that each queue
+ * always remains idle after it is served. Finally, after
+ * remaining idle, each queue receives very quickly a new
+ * request. It follows that the two queues are served
+ * alternatively, preempting each other if needed. This
+ * implies that, although both queues have the same weight,
+ * the queue with large requests receives a service that is
+ * 1024/8 times as high as the service received by the other
+ * queue.
+ *
+ * The motivation for using preemption instead of idling (for
+ * queues with the same weight) is that, by not idling,
+ * service guarantees are preserved (completely or at least in
+ * part) without minimally sacrificing throughput. And, if
+ * there is no active group, then the primary expectation for
+ * this device is probably a high throughput.
+ *
+ * We are now left only with explaining the additional
+ * compound condition that is checked below for deciding
+ * whether the scenario is asymmetric. To explain this
+ * compound condition, we need to add that the function
+ * bfq_asymmetric_scenario checks the weights of only
+ * non-weight-raised queues, for efficiency reasons (see
+ * comments on bfq_weights_tree_add()). Then the fact that
+ * bfqq is weight-raised is checked explicitly here. More
+ * precisely, the compound condition below takes into account
+ * also the fact that, even if bfqq is being weight-raised,
+ * the scenario is still symmetric if all queues with requests
+ * waiting for completion happen to be
+ * weight-raised. Actually, we should be even more precise
+ * here, and differentiate between interactive weight raising
+ * and soft real-time weight raising.
+ *
+ * As a side note, it is worth considering that the above
+ * device-idling countermeasures may however fail in the
+ * following unlucky scenario: if idling is (correctly)
+ * disabled in a time period during which all symmetry
+ * sub-conditions hold, and hence the device is allowed to
+ * enqueue many requests, but at some later point in time some
+ * sub-condition stops to hold, then it may become impossible
+ * to let requests be served in the desired order until all
+ * the requests already queued in the device have been served.
+ */
+static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ return (bfqq->wr_coeff > 1 &&
+ bfqd->wr_busy_queues <
+ bfq_tot_busy_queues(bfqd)) ||
+ bfq_asymmetric_scenario(bfqd, bfqq);
+}
+
+static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ enum bfqq_expiration reason)
{
/*
* If this bfqq is shared between multiple processes, check
if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
bfq_mark_bfqq_split_coop(bfqq);
- if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+ /*
+ * Consider queues with a higher finish virtual time than
+ * bfqq. If idling_needed_for_service_guarantees(bfqq) returns
+ * true, then bfqq's bandwidth would be violated if an
+ * uncontrolled amount of I/O from these queues were
+ * dispatched while bfqq is waiting for its new I/O to
+ * arrive. This is exactly what may happen if this is a forced
+ * expiration caused by a preemption attempt, and if bfqq is
+ * not re-scheduled. To prevent this from happening, re-queue
+ * bfqq if it needs I/O-dispatch plugging, even if it is
+ * empty. By doing so, bfqq is granted to be served before the
+ * above queues (provided that bfqq is of course eligible).
+ */
+ if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ !(reason == BFQQE_PREEMPTED &&
+ idling_needed_for_service_guarantees(bfqd, bfqq))) {
if (bfqq->dispatched == 0)
/*
* Overloading budget_timeout field to store
* Resort priority tree of potential close cooperators.
* See comments on bfq_pos_tree_add_move() for the unlikely().
*/
- if (unlikely(!bfqd->nonrot_with_queueing))
+ if (unlikely(!bfqd->nonrot_with_queueing &&
+ !RB_EMPTY_ROOT(&bfqq->sort_list)))
bfq_pos_tree_add_move(bfqd, bfqq);
}
* reason.
*/
__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
- if (__bfq_bfqq_expire(bfqd, bfqq))
+ if (__bfq_bfqq_expire(bfqd, bfqq, reason))
/* bfqq is gone, no more actions on it */
return;
}
/*
- * There is a case where idling does not have to be performed for
- * throughput concerns, but to preserve the throughput share of
- * the process associated with bfqq.
- *
- * To introduce this case, we can note that allowing the drive
- * to enqueue more than one request at a time, and hence
- * delegating de facto final scheduling decisions to the
- * drive's internal scheduler, entails loss of control on the
- * actual request service order. In particular, the critical
- * situation is when requests from different processes happen
- * to be present, at the same time, in the internal queue(s)
- * of the drive. In such a situation, the drive, by deciding
- * the service order of the internally-queued requests, does
- * determine also the actual throughput distribution among
- * these processes. But the drive typically has no notion or
- * concern about per-process throughput distribution, and
- * makes its decisions only on a per-request basis. Therefore,
- * the service distribution enforced by the drive's internal
- * scheduler is likely to coincide with the desired throughput
- * distribution only in a completely symmetric, or favorably
- * skewed scenario where:
- * (i-a) each of these processes must get the same throughput as
- * the others,
- * (i-b) in case (i-a) does not hold, it holds that the process
- * associated with bfqq must receive a lower or equal
- * throughput than any of the other processes;
- * (ii) the I/O of each process has the same properties, in
- * terms of locality (sequential or random), direction
- * (reads or writes), request sizes, greediness
- * (from I/O-bound to sporadic), and so on;
-
- * In fact, in such a scenario, the drive tends to treat the requests
- * of each process in about the same way as the requests of the
- * others, and thus to provide each of these processes with about the
- * same throughput. This is exactly the desired throughput
- * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
- * even more convenient distribution for (the process associated with)
- * bfqq.
- *
- * In contrast, in any asymmetric or unfavorable scenario, device
- * idling (I/O-dispatch plugging) is certainly needed to guarantee
- * that bfqq receives its assigned fraction of the device throughput
- * (see [1] for details).
- *
- * The problem is that idling may significantly reduce throughput with
- * certain combinations of types of I/O and devices. An important
- * example is sync random I/O on flash storage with command
- * queueing. So, unless bfqq falls in cases where idling also boosts
- * throughput, it is important to check conditions (i-a), i(-b) and
- * (ii) accurately, so as to avoid idling when not strictly needed for
- * service guarantees.
- *
- * Unfortunately, it is extremely difficult to thoroughly check
- * condition (ii). And, in case there are active groups, it becomes
- * very difficult to check conditions (i-a) and (i-b) too. In fact,
- * if there are active groups, then, for conditions (i-a) or (i-b) to
- * become false 'indirectly', it is enough that an active group
- * contains more active processes or sub-groups than some other active
- * group. More precisely, for conditions (i-a) or (i-b) to become
- * false because of such a group, it is not even necessary that the
- * group is (still) active: it is sufficient that, even if the group
- * has become inactive, some of its descendant processes still have
- * some request already dispatched but still waiting for
- * completion. In fact, requests have still to be guaranteed their
- * share of the throughput even after being dispatched. In this
- * respect, it is easy to show that, if a group frequently becomes
- * inactive while still having in-flight requests, and if, when this
- * happens, the group is not considered in the calculation of whether
- * the scenario is asymmetric, then the group may fail to be
- * guaranteed its fair share of the throughput (basically because
- * idling may not be performed for the descendant processes of the
- * group, but it had to be). We address this issue with the following
- * bi-modal behavior, implemented in the function
- * bfq_asymmetric_scenario().
- *
- * If there are groups with requests waiting for completion
- * (as commented above, some of these groups may even be
- * already inactive), then the scenario is tagged as
- * asymmetric, conservatively, without checking any of the
- * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
- * This behavior matches also the fact that groups are created
- * exactly if controlling I/O is a primary concern (to
- * preserve bandwidth and latency guarantees).
- *
- * On the opposite end, if there are no groups with requests waiting
- * for completion, then only conditions (i-a) and (i-b) are actually
- * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
- * idling is not performed, regardless of whether condition (ii)
- * holds. In other words, only if conditions (i-a) and (i-b) do not
- * hold, then idling is allowed, and the device tends to be prevented
- * from queueing many requests, possibly of several processes. Since
- * there are no groups with requests waiting for completion, then, to
- * control conditions (i-a) and (i-b) it is enough to check just
- * whether all the queues with requests waiting for completion also
- * have the same weight.
- *
- * Not checking condition (ii) evidently exposes bfqq to the
- * risk of getting less throughput than its fair share.
- * However, for queues with the same weight, a further
- * mechanism, preemption, mitigates or even eliminates this
- * problem. And it does so without consequences on overall
- * throughput. This mechanism and its benefits are explained
- * in the next three paragraphs.
- *
- * Even if a queue, say Q, is expired when it remains idle, Q
- * can still preempt the new in-service queue if the next
- * request of Q arrives soon (see the comments on
- * bfq_bfqq_update_budg_for_activation). If all queues and
- * groups have the same weight, this form of preemption,
- * combined with the hole-recovery heuristic described in the
- * comments on function bfq_bfqq_update_budg_for_activation,
- * are enough to preserve a correct bandwidth distribution in
- * the mid term, even without idling. In fact, even if not
- * idling allows the internal queues of the device to contain
- * many requests, and thus to reorder requests, we can rather
- * safely assume that the internal scheduler still preserves a
- * minimum of mid-term fairness.
- *
- * More precisely, this preemption-based, idleless approach
- * provides fairness in terms of IOPS, and not sectors per
- * second. This can be seen with a simple example. Suppose
- * that there are two queues with the same weight, but that
- * the first queue receives requests of 8 sectors, while the
- * second queue receives requests of 1024 sectors. In
- * addition, suppose that each of the two queues contains at
- * most one request at a time, which implies that each queue
- * always remains idle after it is served. Finally, after
- * remaining idle, each queue receives very quickly a new
- * request. It follows that the two queues are served
- * alternatively, preempting each other if needed. This
- * implies that, although both queues have the same weight,
- * the queue with large requests receives a service that is
- * 1024/8 times as high as the service received by the other
- * queue.
- *
- * The motivation for using preemption instead of idling (for
- * queues with the same weight) is that, by not idling,
- * service guarantees are preserved (completely or at least in
- * part) without minimally sacrificing throughput. And, if
- * there is no active group, then the primary expectation for
- * this device is probably a high throughput.
- *
- * We are now left only with explaining the additional
- * compound condition that is checked below for deciding
- * whether the scenario is asymmetric. To explain this
- * compound condition, we need to add that the function
- * bfq_asymmetric_scenario checks the weights of only
- * non-weight-raised queues, for efficiency reasons (see
- * comments on bfq_weights_tree_add()). Then the fact that
- * bfqq is weight-raised is checked explicitly here. More
- * precisely, the compound condition below takes into account
- * also the fact that, even if bfqq is being weight-raised,
- * the scenario is still symmetric if all queues with requests
- * waiting for completion happen to be
- * weight-raised. Actually, we should be even more precise
- * here, and differentiate between interactive weight raising
- * and soft real-time weight raising.
- *
- * As a side note, it is worth considering that the above
- * device-idling countermeasures may however fail in the
- * following unlucky scenario: if idling is (correctly)
- * disabled in a time period during which all symmetry
- * sub-conditions hold, and hence the device is allowed to
- * enqueue many requests, but at some later point in time some
- * sub-condition stops to hold, then it may become impossible
- * to let requests be served in the desired order until all
- * the requests already queued in the device have been served.
- */
-static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
- struct bfq_queue *bfqq)
-{
- return (bfqq->wr_coeff > 1 &&
- bfqd->wr_busy_queues <
- bfq_tot_busy_queues(bfqd)) ||
- bfq_asymmetric_scenario(bfqd, bfqq);
-}
-
-/*
* For a queue that becomes empty, device idling is allowed only if
* this function returns true for that queue. As a consequence, since
* device idling plays a critical role for both throughput boosting
(bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
struct bfq_queue *async_bfqq =
bfqq->bic && bfqq->bic->bfqq[0] &&
- bfq_bfqq_busy(bfqq->bic->bfqq[0]) ?
+ bfq_bfqq_busy(bfqq->bic->bfqq[0]) &&
+ bfqq->bic->bfqq[0]->next_rq ?
bfqq->bic->bfqq[0] : NULL;
/*
- * If the process associated with bfqq has also async
- * I/O pending, then inject it
- * unconditionally. Injecting I/O from the same
- * process can cause no harm to the process. On the
- * contrary, it can only increase bandwidth and reduce
- * latency for the process.
+ * The next three mutually-exclusive ifs decide
+ * whether to try injection, and choose the queue to
+ * pick an I/O request from.
+ *
+ * The first if checks whether the process associated
+ * with bfqq has also async I/O pending. If so, it
+ * injects such I/O unconditionally. Injecting async
+ * I/O from the same process can cause no harm to the
+ * process. On the contrary, it can only increase
+ * bandwidth and reduce latency for the process.
+ *
+ * The second if checks whether there happens to be a
+ * non-empty waker queue for bfqq, i.e., a queue whose
+ * I/O needs to be completed for bfqq to receive new
+ * I/O. This happens, e.g., if bfqq is associated with
+ * a process that does some sync. A sync generates
+ * extra blocking I/O, which must be completed before
+ * the process associated with bfqq can go on with its
+ * I/O. If the I/O of the waker queue is not served,
+ * then bfqq remains empty, and no I/O is dispatched,
+ * until the idle timeout fires for bfqq. This is
+ * likely to result in lower bandwidth and higher
+ * latencies for bfqq, and in a severe loss of total
+ * throughput. The best action to take is therefore to
+ * serve the waker queue as soon as possible. So do it
+ * (without relying on the third alternative below for
+ * eventually serving waker_bfqq's I/O; see the last
+ * paragraph for further details). This systematic
+ * injection of I/O from the waker queue does not
+ * cause any delay to bfqq's I/O. On the contrary,
+ * next bfqq's I/O is brought forward dramatically,
+ * for it is not blocked for milliseconds.
+ *
+ * The third if checks whether bfqq is a queue for
+ * which it is better to avoid injection. It is so if
+ * bfqq delivers more throughput when served without
+ * any further I/O from other queues in the middle, or
+ * if the service times of bfqq's I/O requests both
+ * count more than overall throughput, and may be
+ * easily increased by injection (this happens if bfqq
+ * has a short think time). If none of these
+ * conditions holds, then a candidate queue for
+ * injection is looked for through
+ * bfq_choose_bfqq_for_injection(). Note that the
+ * latter may return NULL (for example if the inject
+ * limit for bfqq is currently 0).
+ *
+ * NOTE: motivation for the second alternative
+ *
+ * Thanks to the way the inject limit is updated in
+ * bfq_update_has_short_ttime(), it is rather likely
+ * that, if I/O is being plugged for bfqq and the
+ * waker queue has pending I/O requests that are
+ * blocking bfqq's I/O, then the third alternative
+ * above lets the waker queue get served before the
+ * I/O-plugging timeout fires. So one may deem the
+ * second alternative superfluous. It is not, because
+ * the third alternative may be way less effective in
+ * case of a synchronization. For two main
+ * reasons. First, throughput may be low because the
+ * inject limit may be too low to guarantee the same
+ * amount of injected I/O, from the waker queue or
+ * other queues, that the second alternative
+ * guarantees (the second alternative unconditionally
+ * injects a pending I/O request of the waker queue
+ * for each bfq_dispatch_request()). Second, with the
+ * third alternative, the duration of the plugging,
+ * i.e., the time before bfqq finally receives new I/O,
+ * may not be minimized, because the waker queue may
+ * happen to be served only after other queues.
*/
if (async_bfqq &&
icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
bfq_bfqq_budget_left(async_bfqq))
bfqq = bfqq->bic->bfqq[0];
+ else if (bfq_bfqq_has_waker(bfqq) &&
+ bfq_bfqq_busy(bfqq->waker_bfqq) &&
+ bfqq->next_rq &&
+ bfq_serv_to_charge(bfqq->waker_bfqq->next_rq,
+ bfqq->waker_bfqq) <=
+ bfq_bfqq_budget_left(bfqq->waker_bfqq)
+ )
+ bfqq = bfqq->waker_bfqq;
else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
(bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
!bfq_bfqq_has_short_ttime(bfqq)))
return rq;
}
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
static void bfq_update_dispatch_stats(struct request_queue *q,
struct request *rq,
struct bfq_queue *in_serv_queue,
struct request *rq,
struct bfq_queue *in_serv_queue,
bool idle_timer_disabled) {}
-#endif
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
+ struct bfq_queue *item;
+ struct hlist_node *n;
+
if (bfqq == bfqd->in_service_queue) {
- __bfq_bfqq_expire(bfqd, bfqq);
+ __bfq_bfqq_expire(bfqd, bfqq, BFQQE_BUDGET_TIMEOUT);
bfq_schedule_dispatch(bfqd);
}
bfq_put_cooperator(bfqq);
+ /* remove bfqq from woken list */
+ if (!hlist_unhashed(&bfqq->woken_list_node))
+ hlist_del_init(&bfqq->woken_list_node);
+
+ /* reset waker for all queues in woken list */
+ hlist_for_each_entry_safe(item, n, &bfqq->woken_list,
+ woken_list_node) {
+ item->waker_bfqq = NULL;
+ bfq_clear_bfqq_has_waker(item);
+ hlist_del_init(&item->woken_list_node);
+ }
+
bfq_put_queue(bfqq); /* release process reference */
}
unsigned long flags;
spin_lock_irqsave(&bfqd->lock, flags);
+ bfqq->bic = NULL;
bfq_exit_bfqq(bfqd, bfqq);
bic_set_bfqq(bic, NULL, is_sync);
spin_unlock_irqrestore(&bfqd->lock, flags);
RB_CLEAR_NODE(&bfqq->entity.rb_node);
INIT_LIST_HEAD(&bfqq->fifo);
INIT_HLIST_NODE(&bfqq->burst_list_node);
+ INIT_HLIST_NODE(&bfqq->woken_list_node);
+ INIT_HLIST_HEAD(&bfqq->woken_list);
bfqq->ref = 0;
bfqq->bfqd = bfqd;
struct bfq_queue *bfqq,
struct bfq_io_cq *bic)
{
- bool has_short_ttime = true;
+ bool has_short_ttime = true, state_changed;
/*
* No need to update has_short_ttime if bfqq is async or in
bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle))
has_short_ttime = false;
- bfq_log_bfqq(bfqd, bfqq, "update_has_short_ttime: has_short_ttime %d",
- has_short_ttime);
+ state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);
if (has_short_ttime)
bfq_mark_bfqq_has_short_ttime(bfqq);
else
bfq_clear_bfqq_has_short_ttime(bfqq);
+
+ /*
+ * Until the base value for the total service time gets
+ * finally computed for bfqq, the inject limit does depend on
+ * the think-time state (short|long). In particular, the limit
+ * is 0 or 1 if the think time is deemed, respectively, as
+ * short or long (details in the comments in
+ * bfq_update_inject_limit()). Accordingly, the next
+ * instructions reset the inject limit if the think-time state
+ * has changed and the above base value is still to be
+ * computed.
+ *
+ * However, the reset is performed only if more than 100 ms
+ * have elapsed since the last update of the inject limit, or
+ * (inclusive) if the change is from short to long think
+ * time. The reason for this waiting is as follows.
+ *
+ * bfqq may have a long think time because of a
+ * synchronization with some other queue, i.e., because the
+ * I/O of some other queue may need to be completed for bfqq
+ * to receive new I/O. Details in the comments on the choice
+ * of the queue for injection in bfq_select_queue().
+ *
+ * As stressed in those comments, if such a synchronization is
+ * actually in place, then, without injection on bfqq, the
+ * blocking I/O cannot happen to served while bfqq is in
+ * service. As a consequence, if bfqq is granted
+ * I/O-dispatch-plugging, then bfqq remains empty, and no I/O
+ * is dispatched, until the idle timeout fires. This is likely
+ * to result in lower bandwidth and higher latencies for bfqq,
+ * and in a severe loss of total throughput.
+ *
+ * On the opposite end, a non-zero inject limit may allow the
+ * I/O that blocks bfqq to be executed soon, and therefore
+ * bfqq to receive new I/O soon.
+ *
+ * But, if the blocking gets actually eliminated, then the
+ * next think-time sample for bfqq may be very low. This in
+ * turn may cause bfqq's think time to be deemed
+ * short. Without the 100 ms barrier, this new state change
+ * would cause the body of the next if to be executed
+ * immediately. But this would set to 0 the inject
+ * limit. Without injection, the blocking I/O would cause the
+ * think time of bfqq to become long again, and therefore the
+ * inject limit to be raised again, and so on. The only effect
+ * of such a steady oscillation between the two think-time
+ * states would be to prevent effective injection on bfqq.
+ *
+ * In contrast, if the inject limit is not reset during such a
+ * long time interval as 100 ms, then the number of short
+ * think time samples can grow significantly before the reset
+ * is performed. As a consequence, the think time state can
+ * become stable before the reset. Therefore there will be no
+ * state change when the 100 ms elapse, and no reset of the
+ * inject limit. The inject limit remains steadily equal to 1
+ * both during and after the 100 ms. So injection can be
+ * performed at all times, and throughput gets boosted.
+ *
+ * An inject limit equal to 1 is however in conflict, in
+ * general, with the fact that the think time of bfqq is
+ * short, because injection may be likely to delay bfqq's I/O
+ * (as explained in the comments in
+ * bfq_update_inject_limit()). But this does not happen in
+ * this special case, because bfqq's low think time is due to
+ * an effective handling of a synchronization, through
+ * injection. In this special case, bfqq's I/O does not get
+ * delayed by injection; on the contrary, bfqq's I/O is
+ * brought forward, because it is not blocked for
+ * milliseconds.
+ *
+ * In addition, serving the blocking I/O much sooner, and much
+ * more frequently than once per I/O-plugging timeout, makes
+ * it much quicker to detect a waker queue (the concept of
+ * waker queue is defined in the comments in
+ * bfq_add_request()). This makes it possible to start sooner
+ * to boost throughput more effectively, by injecting the I/O
+ * of the waker queue unconditionally on every
+ * bfq_dispatch_request().
+ *
+ * One last, important benefit of not resetting the inject
+ * limit before 100 ms is that, during this time interval, the
+ * base value for the total service time is likely to get
+ * finally computed for bfqq, freeing the inject limit from
+ * its relation with the think time.
+ */
+ if (state_changed && bfqq->last_serv_time_ns == 0 &&
+ (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(100)) ||
+ !has_short_ttime))
+ bfq_reset_inject_limit(bfqd, bfqq);
}
/*
static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct request *rq)
{
- struct bfq_io_cq *bic = RQ_BIC(rq);
-
if (rq->cmd_flags & REQ_META)
bfqq->meta_pending++;
- bfq_update_io_thinktime(bfqd, bfqq);
- bfq_update_has_short_ttime(bfqd, bfqq, bic);
- bfq_update_io_seektime(bfqd, bfqq, rq);
-
- bfq_log_bfqq(bfqd, bfqq,
- "rq_enqueued: has_short_ttime=%d (seeky %d)",
- bfq_bfqq_has_short_ttime(bfqq), BFQQ_SEEKY(bfqq));
-
bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
bfqq = new_bfqq;
}
+ bfq_update_io_thinktime(bfqd, bfqq);
+ bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq));
+ bfq_update_io_seektime(bfqd, bfqq, rq);
+
waiting = bfqq && bfq_bfqq_wait_request(bfqq);
bfq_add_request(rq);
idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);
return idle_timer_disabled;
}
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
static void bfq_update_insert_stats(struct request_queue *q,
struct bfq_queue *bfqq,
bool idle_timer_disabled,
struct bfq_queue *bfqq,
bool idle_timer_disabled,
unsigned int cmd_flags) {}
-#endif
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
bool at_head)
1UL<<(BFQ_RATE_SHIFT - 10))
bfq_update_rate_reset(bfqd, NULL);
bfqd->last_completion = now_ns;
+ bfqd->last_completed_rq_bfqq = bfqq;
/*
* If we are waiting to discover whether the request pattern
* total service time, and there seem to be the right
* conditions to do it, or we can lower the last base value
* computed.
+ *
+ * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O
+ * request in flight, because this function is in the code
+ * path that handles the completion of a request of bfqq, and,
+ * in particular, this function is executed before
+ * bfqd->rq_in_driver is decremented in such a code path.
*/
- if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 0) ||
+ if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) ||
tot_time_ns < bfqq->last_serv_time_ns) {
bfqq->last_serv_time_ns = tot_time_ns;
/*
* start trying injection.
*/
bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
- }
+ } else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1)
+ /*
+ * No I/O injected and no request still in service in
+ * the drive: these are the exact conditions for
+ * computing the base value of the total service time
+ * for bfqq. So let's update this value, because it is
+ * rather variable. For example, it varies if the size
+ * or the spatial locality of the I/O requests in bfqq
+ * change.
+ */
+ bfqq->last_serv_time_ns = tot_time_ns;
+
/* update complete, not waiting for any request completion any longer */
bfqd->waited_rq = NULL;
/* max service rate measured so far */
u32 max_service_rate;
+
+ /*
+ * Pointer to the waker queue for this queue, i.e., to the
+ * queue Q such that this queue happens to get new I/O right
+ * after some I/O request of Q is completed. For details, see
+ * the comments on the choice of the queue for injection in
+ * bfq_select_queue().
+ */
+ struct bfq_queue *waker_bfqq;
+ /* node for woken_list, see below */
+ struct hlist_node woken_list_node;
+ /*
+ * Head of the list of the woken queues for this queue, i.e.,
+ * of the list of the queues for which this queue is a waker
+ * queue. This list is used to reset the waker_bfqq pointer in
+ * the woken queues when this queue exits.
+ */
+ struct hlist_head woken_list;
};
/**
/* time of last request completion (ns) */
u64 last_completion;
+ /* bfqq owning the last completed rq */
+ struct bfq_queue *last_completed_rq_bfqq;
+
/* time of last transition from empty to non-empty (ns) */
u64 last_empty_occupied_ns;
* update
*/
BFQQF_coop, /* bfqq is shared */
- BFQQF_split_coop /* shared bfqq will be split */
+ BFQQF_split_coop, /* shared bfqq will be split */
+ BFQQF_has_waker /* bfqq has a waker queue */
};
#define BFQ_BFQQ_FNS(name) \
BFQ_BFQQ_FNS(coop);
BFQ_BFQQ_FNS(split_coop);
BFQ_BFQQ_FNS(softrt_update);
+BFQ_BFQQ_FNS(has_waker);
#undef BFQ_BFQQ_FNS
/* Expiration reasons. */
BFQQE_PREEMPTED /* preemption in progress */
};
+struct bfq_stat {
+ struct percpu_counter cpu_cnt;
+ atomic64_t aux_cnt;
+};
+
struct bfqg_stats {
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
+#ifdef CONFIG_BFQ_CGROUP_DEBUG
/* number of ios merged */
struct blkg_rwstat merged;
/* total time spent on device in ns, may not be accurate w/ queueing */
/* number of IOs queued up */
struct blkg_rwstat queued;
/* total disk time and nr sectors dispatched by this group */
- struct blkg_stat time;
+ struct bfq_stat time;
/* sum of number of ios queued across all samples */
- struct blkg_stat avg_queue_size_sum;
+ struct bfq_stat avg_queue_size_sum;
/* count of samples taken for average */
- struct blkg_stat avg_queue_size_samples;
+ struct bfq_stat avg_queue_size_samples;
/* how many times this group has been removed from service tree */
- struct blkg_stat dequeue;
+ struct bfq_stat dequeue;
/* total time spent waiting for it to be assigned a timeslice. */
- struct blkg_stat group_wait_time;
+ struct bfq_stat group_wait_time;
/* time spent idling for this blkcg_gq */
- struct blkg_stat idle_time;
+ struct bfq_stat idle_time;
/* total time with empty current active q with other requests queued */
- struct blkg_stat empty_time;
+ struct bfq_stat empty_time;
/* fields after this shouldn't be cleared on stat reset */
u64 start_group_wait_time;
u64 start_idle_time;
u64 start_empty_time;
uint16_t flags;
-#endif /* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */
+#endif /* CONFIG_BFQ_CGROUP_DEBUG */
};
#ifdef CONFIG_BFQ_GROUP_IOSCHED
}
EXPORT_SYMBOL(bio_put);
-int bio_phys_segments(struct request_queue *q, struct bio *bio)
-{
- if (unlikely(!bio_flagged(bio, BIO_SEG_VALID)))
- blk_recount_segments(q, bio);
-
- return bio->bi_phys_segments;
-}
-
/**
* __bio_clone_fast - clone a bio that shares the original bio's biovec
* @bio: destination bio
}
}
- if (bio_full(bio))
+ if (bio_full(bio, len))
return 0;
- if (bio->bi_phys_segments >= queue_max_segments(q))
+ if (bio->bi_vcnt >= queue_max_segments(q))
return 0;
bvec = &bio->bi_io_vec[bio->bi_vcnt];
bio->bi_vcnt++;
done:
bio->bi_iter.bi_size += len;
- bio->bi_phys_segments = bio->bi_vcnt;
- bio_set_flag(bio, BIO_SEG_VALID);
return len;
}
struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt];
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
- WARN_ON_ONCE(bio_full(bio));
+ WARN_ON_ONCE(bio_full(bio, len));
bv->bv_page = page;
bv->bv_offset = off;
bool same_page = false;
if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
- if (bio_full(bio))
+ if (bio_full(bio, len))
return 0;
__bio_add_page(bio, page, len, offset);
}
}
EXPORT_SYMBOL(bio_add_page);
-static void bio_get_pages(struct bio *bio)
+void bio_release_pages(struct bio *bio, bool mark_dirty)
{
struct bvec_iter_all iter_all;
struct bio_vec *bvec;
- bio_for_each_segment_all(bvec, bio, iter_all)
- get_page(bvec->bv_page);
-}
-
-static void bio_release_pages(struct bio *bio)
-{
- struct bvec_iter_all iter_all;
- struct bio_vec *bvec;
+ if (bio_flagged(bio, BIO_NO_PAGE_REF))
+ return;
- bio_for_each_segment_all(bvec, bio, iter_all)
+ bio_for_each_segment_all(bvec, bio, iter_all) {
+ if (mark_dirty && !PageCompound(bvec->bv_page))
+ set_page_dirty_lock(bvec->bv_page);
put_page(bvec->bv_page);
+ }
}
static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
if (same_page)
put_page(page);
} else {
- if (WARN_ON_ONCE(bio_full(bio)))
+ if (WARN_ON_ONCE(bio_full(bio, len)))
return -EINVAL;
__bio_add_page(bio, page, len, offset);
}
ret = __bio_iov_bvec_add_pages(bio, iter);
else
ret = __bio_iov_iter_get_pages(bio, iter);
- } while (!ret && iov_iter_count(iter) && !bio_full(bio));
+ } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
- if (iov_iter_bvec_no_ref(iter))
+ if (is_bvec)
bio_set_flag(bio, BIO_NO_PAGE_REF);
- else if (is_bvec)
- bio_get_pages(bio);
-
return bio->bi_vcnt ? 0 : ret;
}
if (data->nr_segs > UIO_MAXIOV)
return NULL;
- bmd = kmalloc(sizeof(struct bio_map_data) +
- sizeof(struct iovec) * data->nr_segs, gfp_mask);
+ bmd = kmalloc(struct_size(bmd, iov, data->nr_segs), gfp_mask);
if (!bmd)
return NULL;
memcpy(bmd->iov, data->iov, sizeof(struct iovec) * data->nr_segs);
int j;
struct bio *bio;
int ret;
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
return bio;
out_unmap:
- bio_for_each_segment_all(bvec, bio, iter_all) {
- put_page(bvec->bv_page);
- }
+ bio_release_pages(bio, false);
bio_put(bio);
return ERR_PTR(ret);
}
-static void __bio_unmap_user(struct bio *bio)
-{
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
-
- /*
- * make sure we dirty pages we wrote to
- */
- bio_for_each_segment_all(bvec, bio, iter_all) {
- if (bio_data_dir(bio) == READ)
- set_page_dirty_lock(bvec->bv_page);
-
- put_page(bvec->bv_page);
- }
-
- bio_put(bio);
-}
-
/**
* bio_unmap_user - unmap a bio
* @bio: the bio being unmapped
*/
void bio_unmap_user(struct bio *bio)
{
- __bio_unmap_user(bio);
+ bio_release_pages(bio, bio_data_dir(bio) == READ);
+ bio_put(bio);
bio_put(bio);
}
while ((bio = next) != NULL) {
next = bio->bi_private;
- bio_set_pages_dirty(bio);
- if (!bio_flagged(bio, BIO_NO_PAGE_REF))
- bio_release_pages(bio);
+ bio_release_pages(bio, true);
bio_put(bio);
}
}
goto defer;
}
- if (!bio_flagged(bio, BIO_NO_PAGE_REF))
- bio_release_pages(bio);
+ bio_release_pages(bio, false);
bio_put(bio);
return;
defer:
}
EXPORT_SYMBOL(generic_end_io_acct);
-#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
-void bio_flush_dcache_pages(struct bio *bi)
-{
- struct bio_vec bvec;
- struct bvec_iter iter;
-
- bio_for_each_segment(bvec, bi, iter)
- flush_dcache_page(bvec.bv_page);
-}
-EXPORT_SYMBOL(bio_flush_dcache_pages);
-#endif
-
static inline bool bio_remaining_done(struct bio *bio)
{
/*
if (offset == 0 && size == bio->bi_iter.bi_size)
return;
- bio_clear_flag(bio, BIO_SEG_VALID);
-
bio_advance(bio, offset << 9);
-
bio->bi_iter.bi_size = size;
if (bio_integrity(bio))
blkg_rwstat_exit(&blkg->stat_ios);
blkg_rwstat_exit(&blkg->stat_bytes);
+ percpu_ref_exit(&blkg->refcnt);
kfree(blkg);
}
{
struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head);
- percpu_ref_exit(&blkg->refcnt);
-
/* release the blkcg and parent blkg refs this blkg has been holding */
css_put(&blkg->blkcg->css);
if (blkg->parent)
if (!blkg)
return NULL;
+ if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
+ goto err_free;
+
if (blkg_rwstat_init(&blkg->stat_bytes, gfp_mask) ||
blkg_rwstat_init(&blkg->stat_ios, gfp_mask))
goto err_free;
blkg_get(blkg->parent);
}
- ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0,
- GFP_NOWAIT | __GFP_NOWARN);
- if (ret)
- goto err_cancel_ref;
-
/* invoke per-policy init */
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
blkg_put(blkg);
return ERR_PTR(ret);
-err_cancel_ref:
- percpu_ref_exit(&blkg->refcnt);
err_put_congested:
wb_congested_put(wb_congested);
err_put_css:
* Print @rwstat to @sf for the device assocaited with @pd.
*/
u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
- const struct blkg_rwstat *rwstat)
+ const struct blkg_rwstat_sample *rwstat)
{
static const char *rwstr[] = {
[BLKG_RWSTAT_READ] = "Read",
for (i = 0; i < BLKG_RWSTAT_NR; i++)
seq_printf(sf, "%s %s %llu\n", dname, rwstr[i],
- (unsigned long long)atomic64_read(&rwstat->aux_cnt[i]));
+ rwstat->cnt[i]);
- v = atomic64_read(&rwstat->aux_cnt[BLKG_RWSTAT_READ]) +
- atomic64_read(&rwstat->aux_cnt[BLKG_RWSTAT_WRITE]) +
- atomic64_read(&rwstat->aux_cnt[BLKG_RWSTAT_DISCARD]);
- seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v);
+ v = rwstat->cnt[BLKG_RWSTAT_READ] +
+ rwstat->cnt[BLKG_RWSTAT_WRITE] +
+ rwstat->cnt[BLKG_RWSTAT_DISCARD];
+ seq_printf(sf, "%s Total %llu\n", dname, v);
return v;
}
EXPORT_SYMBOL_GPL(__blkg_prfill_rwstat);
/**
- * blkg_prfill_stat - prfill callback for blkg_stat
- * @sf: seq_file to print to
- * @pd: policy private data of interest
- * @off: offset to the blkg_stat in @pd
- *
- * prfill callback for printing a blkg_stat.
- */
-u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off)
-{
- return __blkg_prfill_u64(sf, pd, blkg_stat_read((void *)pd + off));
-}
-EXPORT_SYMBOL_GPL(blkg_prfill_stat);
-
-/**
* blkg_prfill_rwstat - prfill callback for blkg_rwstat
* @sf: seq_file to print to
* @pd: policy private data of interest
u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
- struct blkg_rwstat rwstat = blkg_rwstat_read((void *)pd + off);
+ struct blkg_rwstat_sample rwstat = { };
+ blkg_rwstat_read((void *)pd + off, &rwstat);
return __blkg_prfill_rwstat(sf, pd, &rwstat);
}
EXPORT_SYMBOL_GPL(blkg_prfill_rwstat);
static u64 blkg_prfill_rwstat_field(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
- struct blkg_rwstat rwstat = blkg_rwstat_read((void *)pd->blkg + off);
+ struct blkg_rwstat_sample rwstat = { };
+ blkg_rwstat_read((void *)pd->blkg + off, &rwstat);
return __blkg_prfill_rwstat(sf, pd, &rwstat);
}
struct blkg_policy_data *pd,
int off)
{
- struct blkg_rwstat rwstat = blkg_rwstat_recursive_sum(pd->blkg,
- NULL, off);
+ struct blkg_rwstat_sample rwstat;
+
+ blkg_rwstat_recursive_sum(pd->blkg, NULL, off, &rwstat);
return __blkg_prfill_rwstat(sf, pd, &rwstat);
}
EXPORT_SYMBOL_GPL(blkg_print_stat_ios_recursive);
/**
- * blkg_stat_recursive_sum - collect hierarchical blkg_stat
- * @blkg: blkg of interest
- * @pol: blkcg_policy which contains the blkg_stat
- * @off: offset to the blkg_stat in blkg_policy_data or @blkg
- *
- * Collect the blkg_stat specified by @blkg, @pol and @off and all its
- * online descendants and their aux counts. The caller must be holding the
- * queue lock for online tests.
- *
- * If @pol is NULL, blkg_stat is at @off bytes into @blkg; otherwise, it is
- * at @off bytes into @blkg's blkg_policy_data of the policy.
- */
-u64 blkg_stat_recursive_sum(struct blkcg_gq *blkg,
- struct blkcg_policy *pol, int off)
-{
- struct blkcg_gq *pos_blkg;
- struct cgroup_subsys_state *pos_css;
- u64 sum = 0;
-
- lockdep_assert_held(&blkg->q->queue_lock);
-
- rcu_read_lock();
- blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
- struct blkg_stat *stat;
-
- if (!pos_blkg->online)
- continue;
-
- if (pol)
- stat = (void *)blkg_to_pd(pos_blkg, pol) + off;
- else
- stat = (void *)blkg + off;
-
- sum += blkg_stat_read(stat) + atomic64_read(&stat->aux_cnt);
- }
- rcu_read_unlock();
-
- return sum;
-}
-EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum);
-
-/**
* blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat
* @blkg: blkg of interest
* @pol: blkcg_policy which contains the blkg_rwstat
* @off: offset to the blkg_rwstat in blkg_policy_data or @blkg
+ * @sum: blkg_rwstat_sample structure containing the results
*
* Collect the blkg_rwstat specified by @blkg, @pol and @off and all its
* online descendants and their aux counts. The caller must be holding the
* If @pol is NULL, blkg_rwstat is at @off bytes into @blkg; otherwise, it
* is at @off bytes into @blkg's blkg_policy_data of the policy.
*/
-struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg,
- struct blkcg_policy *pol, int off)
+void blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, struct blkcg_policy *pol,
+ int off, struct blkg_rwstat_sample *sum)
{
struct blkcg_gq *pos_blkg;
struct cgroup_subsys_state *pos_css;
- struct blkg_rwstat sum = { };
- int i;
+ unsigned int i;
lockdep_assert_held(&blkg->q->queue_lock);
rwstat = (void *)pos_blkg + off;
for (i = 0; i < BLKG_RWSTAT_NR; i++)
- atomic64_add(atomic64_read(&rwstat->aux_cnt[i]) +
- percpu_counter_sum_positive(&rwstat->cpu_cnt[i]),
- &sum.aux_cnt[i]);
+ sum->cnt[i] = blkg_rwstat_read_counter(rwstat, i);
}
rcu_read_unlock();
-
- return sum;
}
EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
const char *dname;
char *buf;
- struct blkg_rwstat rwstat;
+ struct blkg_rwstat_sample rwstat;
u64 rbytes, wbytes, rios, wios, dbytes, dios;
size_t size = seq_get_buf(sf, &buf), off = 0;
int i;
spin_lock_irq(&blkg->q->queue_lock);
- rwstat = blkg_rwstat_recursive_sum(blkg, NULL,
- offsetof(struct blkcg_gq, stat_bytes));
- rbytes = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_READ]);
- wbytes = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_WRITE]);
- dbytes = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_DISCARD]);
+ blkg_rwstat_recursive_sum(blkg, NULL,
+ offsetof(struct blkcg_gq, stat_bytes), &rwstat);
+ rbytes = rwstat.cnt[BLKG_RWSTAT_READ];
+ wbytes = rwstat.cnt[BLKG_RWSTAT_WRITE];
+ dbytes = rwstat.cnt[BLKG_RWSTAT_DISCARD];
- rwstat = blkg_rwstat_recursive_sum(blkg, NULL,
- offsetof(struct blkcg_gq, stat_ios));
- rios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_READ]);
- wios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_WRITE]);
- dios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_DISCARD]);
+ blkg_rwstat_recursive_sum(blkg, NULL,
+ offsetof(struct blkcg_gq, stat_ios), &rwstat);
+ rios = rwstat.cnt[BLKG_RWSTAT_READ];
+ wios = rwstat.cnt[BLKG_RWSTAT_WRITE];
+ dios = rwstat.cnt[BLKG_RWSTAT_DISCARD];
spin_unlock_irq(&blkg->q->queue_lock);
}
next:
if (has_stats) {
- off += scnprintf(buf+off, size-off, "\n");
- seq_commit(sf, off);
+ if (off < size - 1) {
+ off += scnprintf(buf+off, size-off, "\n");
+ seq_commit(sf, off);
+ } else {
+ seq_commit(sf, -1);
+ }
}
}
spin_lock_irq(&q->queue_lock);
- list_for_each_entry(blkg, &q->blkg_list, q_node) {
+ /* blkg_list is pushed at the head, reverse walk to init parents first */
+ list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
struct blkg_policy_data *pd;
if (blkg->pd[pol->plid])
}
EXPORT_SYMBOL(blk_rq_init);
+#define REQ_OP_NAME(name) [REQ_OP_##name] = #name
+static const char *const blk_op_name[] = {
+ REQ_OP_NAME(READ),
+ REQ_OP_NAME(WRITE),
+ REQ_OP_NAME(FLUSH),
+ REQ_OP_NAME(DISCARD),
+ REQ_OP_NAME(SECURE_ERASE),
+ REQ_OP_NAME(ZONE_RESET),
+ REQ_OP_NAME(WRITE_SAME),
+ REQ_OP_NAME(WRITE_ZEROES),
+ REQ_OP_NAME(SCSI_IN),
+ REQ_OP_NAME(SCSI_OUT),
+ REQ_OP_NAME(DRV_IN),
+ REQ_OP_NAME(DRV_OUT),
+};
+#undef REQ_OP_NAME
+
+/**
+ * blk_op_str - Return string XXX in the REQ_OP_XXX.
+ * @op: REQ_OP_XXX.
+ *
+ * Description: Centralize block layer function to convert REQ_OP_XXX into
+ * string format. Useful in the debugging and tracing bio or request. For
+ * invalid REQ_OP_XXX it returns string "UNKNOWN".
+ */
+inline const char *blk_op_str(unsigned int op)
+{
+ const char *op_str = "UNKNOWN";
+
+ if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op])
+ op_str = blk_op_name[op];
+
+ return op_str;
+}
+EXPORT_SYMBOL_GPL(blk_op_str);
+
static const struct {
int errno;
const char *name;
}
EXPORT_SYMBOL_GPL(blk_status_to_errno);
-static void print_req_error(struct request *req, blk_status_t status)
+static void print_req_error(struct request *req, blk_status_t status,
+ const char *caller)
{
int idx = (__force int)status;
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
return;
- printk_ratelimited(KERN_ERR "%s: %s error, dev %s, sector %llu flags %x\n",
- __func__, blk_errors[idx].name,
- req->rq_disk ? req->rq_disk->disk_name : "?",
- (unsigned long long)blk_rq_pos(req),
- req->cmd_flags);
+ printk_ratelimited(KERN_ERR
+ "%s: %s error, dev %s, sector %llu op 0x%x:(%s) flags 0x%x "
+ "phys_seg %u prio class %u\n",
+ caller, blk_errors[idx].name,
+ req->rq_disk ? req->rq_disk->disk_name : "?",
+ blk_rq_pos(req), req_op(req), blk_op_str(req_op(req)),
+ req->cmd_flags & ~REQ_OP_MASK,
+ req->nr_phys_segments,
+ IOPRIO_PRIO_CLASS(req->ioprio));
}
static void req_bio_endio(struct request *rq, struct bio *bio,
}
EXPORT_SYMBOL(blk_put_request);
-bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
- struct bio *bio)
+bool bio_attempt_back_merge(struct request *req, struct bio *bio,
+ unsigned int nr_segs)
{
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
- if (!ll_back_merge_fn(q, req, bio))
+ if (!ll_back_merge_fn(req, bio, nr_segs))
return false;
- trace_block_bio_backmerge(q, req, bio);
+ trace_block_bio_backmerge(req->q, req, bio);
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
blk_rq_set_mixed_merge(req);
return true;
}
-bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
- struct bio *bio)
+bool bio_attempt_front_merge(struct request *req, struct bio *bio,
+ unsigned int nr_segs)
{
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
- if (!ll_front_merge_fn(q, req, bio))
+ if (!ll_front_merge_fn(req, bio, nr_segs))
return false;
- trace_block_bio_frontmerge(q, req, bio);
+ trace_block_bio_frontmerge(req->q, req, bio);
if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
blk_rq_set_mixed_merge(req);
* blk_attempt_plug_merge - try to merge with %current's plugged list
* @q: request_queue new bio is being queued at
* @bio: new bio being queued
+ * @nr_segs: number of segments in @bio
* @same_queue_rq: pointer to &struct request that gets filled in when
* another request associated with @q is found on the plug list
* (optional, may be %NULL)
* Caller must ensure !blk_queue_nomerges(q) beforehand.
*/
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
- struct request **same_queue_rq)
+ unsigned int nr_segs, struct request **same_queue_rq)
{
struct blk_plug *plug;
struct request *rq;
switch (blk_try_merge(rq, bio)) {
case ELEVATOR_BACK_MERGE:
- merged = bio_attempt_back_merge(q, rq, bio);
+ merged = bio_attempt_back_merge(rq, bio, nr_segs);
break;
case ELEVATOR_FRONT_MERGE:
- merged = bio_attempt_front_merge(q, rq, bio);
+ merged = bio_attempt_front_merge(rq, bio, nr_segs);
break;
case ELEVATOR_DISCARD_MERGE:
merged = bio_attempt_discard_merge(q, rq, bio);
return false;
}
-void blk_init_request_from_bio(struct request *req, struct bio *bio)
-{
- if (bio->bi_opf & REQ_RAHEAD)
- req->cmd_flags |= REQ_FAILFAST_MASK;
-
- req->__sector = bio->bi_iter.bi_sector;
- req->ioprio = bio_prio(bio);
- req->write_hint = bio->bi_write_hint;
- blk_rq_bio_prep(req->q, req, bio);
-}
-EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
-
static void handle_bad_sector(struct bio *bio, sector_t maxsector)
{
char b[BDEVNAME_SIZE];
* Recalculate it to check the request correctly on this queue's
* limitation.
*/
- blk_recalc_rq_segments(rq);
+ rq->nr_phys_segments = blk_recalc_rq_segments(rq);
if (rq->nr_phys_segments > queue_max_segments(q)) {
printk(KERN_ERR "%s: over max segments limit. (%hu > %hu)\n",
__func__, rq->nr_phys_segments, queue_max_segments(q));
*
* This special helper function is only for request stacking drivers
* (e.g. request-based dm) so that they can handle partial completion.
- * Actual device drivers should use blk_end_request instead.
+ * Actual device drivers should use blk_mq_end_request instead.
*
* Passing the result of blk_rq_bytes() as @nr_bytes guarantees
* %false return from this function.
if (unlikely(error && !blk_rq_is_passthrough(req) &&
!(req->rq_flags & RQF_QUIET)))
- print_req_error(req, error);
+ print_req_error(req, error, __func__);
blk_account_io_completion(req, nr_bytes);
}
/* recalculate the number of segments */
- blk_recalc_rq_segments(req);
+ req->nr_phys_segments = blk_recalc_rq_segments(req);
}
return true;
}
EXPORT_SYMBOL_GPL(blk_update_request);
-void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
- struct bio *bio)
-{
- if (bio_has_data(bio))
- rq->nr_phys_segments = bio_phys_segments(q, bio);
- else if (bio_op(bio) == REQ_OP_DISCARD)
- rq->nr_phys_segments = 1;
-
- rq->__data_len = bio->bi_iter.bi_size;
- rq->bio = rq->biotail = bio;
-
- if (bio->bi_disk)
- rq->rq_disk = bio->bi_disk;
-}
-
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
/**
* rq_flush_dcache_pages - Helper function to flush all pages in a request
inflight = atomic_dec_return(&rqw->inflight);
WARN_ON_ONCE(inflight < 0);
- if (iolat->min_lat_nsec == 0)
- goto next;
- iolatency_record_time(iolat, &bio->bi_issue, now,
- issue_as_root);
- window_start = atomic64_read(&iolat->window_start);
- if (now > window_start &&
- (now - window_start) >= iolat->cur_win_nsec) {
- if (atomic64_cmpxchg(&iolat->window_start,
- window_start, now) == window_start)
- iolatency_check_latencies(iolat, now);
+ /*
+ * If bi_status is BLK_STS_AGAIN, the bio wasn't actually
+ * submitted, so do not account for it.
+ */
+ if (iolat->min_lat_nsec && bio->bi_status != BLK_STS_AGAIN) {
+ iolatency_record_time(iolat, &bio->bi_issue, now,
+ issue_as_root);
+ window_start = atomic64_read(&iolat->window_start);
+ if (now > window_start &&
+ (now - window_start) >= iolat->cur_win_nsec) {
+ if (atomic64_cmpxchg(&iolat->window_start,
+ window_start, now) == window_start)
+ iolatency_check_latencies(iolat, now);
+ }
}
-next:
wake_up(&rqw->wait);
blkg = blkg->parent;
}
}
-static void blkcg_iolatency_cleanup(struct rq_qos *rqos, struct bio *bio)
-{
- struct blkcg_gq *blkg;
-
- blkg = bio->bi_blkg;
- while (blkg && blkg->parent) {
- struct rq_wait *rqw;
- struct iolatency_grp *iolat;
-
- iolat = blkg_to_lat(blkg);
- if (!iolat)
- goto next;
-
- rqw = &iolat->rq_wait;
- atomic_dec(&rqw->inflight);
- wake_up(&rqw->wait);
-next:
- blkg = blkg->parent;
- }
-}
-
static void blkcg_iolatency_exit(struct rq_qos *rqos)
{
struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos);
static struct rq_qos_ops blkcg_iolatency_ops = {
.throttle = blkcg_iolatency_throttle,
- .cleanup = blkcg_iolatency_cleanup,
.done_bio = blkcg_iolatency_done_bio,
.exit = blkcg_iolatency_exit,
};
if (!oldval && val)
return 1;
- if (oldval && !val)
+ if (oldval && !val) {
+ blkcg_clear_delay(blkg);
return -1;
+ }
return 0;
}
int blk_rq_append_bio(struct request *rq, struct bio **bio)
{
struct bio *orig_bio = *bio;
+ struct bvec_iter iter;
+ struct bio_vec bv;
+ unsigned int nr_segs = 0;
blk_queue_bounce(rq->q, bio);
+ bio_for_each_bvec(bv, *bio, iter)
+ nr_segs++;
+
if (!rq->bio) {
- blk_rq_bio_prep(rq->q, rq, *bio);
+ blk_rq_bio_prep(rq, *bio, nr_segs);
} else {
- if (!ll_back_merge_fn(rq->q, rq, *bio)) {
+ if (!ll_back_merge_fn(rq, *bio, nr_segs)) {
if (orig_bio != *bio) {
bio_put(*bio);
*bio = orig_bio;
static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
struct bio *bio, struct bio_set *bs, unsigned *nsegs)
{
- *nsegs = 1;
+ *nsegs = 0;
if (!q->limits.max_write_zeroes_sectors)
return NULL;
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
unsigned nsegs = 0, sectors = 0;
- bool do_split = true;
- struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
const unsigned max_segs = queue_max_segments(q);
}
}
- do_split = false;
+ *segs = nsegs;
+ return NULL;
split:
*segs = nsegs;
-
- if (do_split) {
- new = bio_split(bio, sectors, GFP_NOIO, bs);
- if (new)
- bio = new;
- }
-
- return do_split ? new : NULL;
+ return bio_split(bio, sectors, GFP_NOIO, bs);
}
-void blk_queue_split(struct request_queue *q, struct bio **bio)
+void __blk_queue_split(struct request_queue *q, struct bio **bio,
+ unsigned int *nr_segs)
{
- struct bio *split, *res;
- unsigned nsegs;
+ struct bio *split;
switch (bio_op(*bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
- split = blk_bio_discard_split(q, *bio, &q->bio_split, &nsegs);
+ split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
break;
case REQ_OP_WRITE_ZEROES:
- split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split, &nsegs);
+ split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
+ nr_segs);
break;
case REQ_OP_WRITE_SAME:
- split = blk_bio_write_same_split(q, *bio, &q->bio_split, &nsegs);
+ split = blk_bio_write_same_split(q, *bio, &q->bio_split,
+ nr_segs);
break;
default:
- split = blk_bio_segment_split(q, *bio, &q->bio_split, &nsegs);
+ split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
break;
}
- /* physical segments can be figured out during splitting */
- res = split ? split : *bio;
- res->bi_phys_segments = nsegs;
- bio_set_flag(res, BIO_SEG_VALID);
-
if (split) {
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;
*bio = split;
}
}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio)
+{
+ unsigned int nr_segs;
+
+ __blk_queue_split(q, bio, &nr_segs);
+}
EXPORT_SYMBOL(blk_queue_split);
-static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
- struct bio *bio)
+unsigned int blk_recalc_rq_segments(struct request *rq)
{
unsigned int nr_phys_segs = 0;
- struct bvec_iter iter;
+ struct req_iterator iter;
struct bio_vec bv;
- if (!bio)
+ if (!rq->bio)
return 0;
- switch (bio_op(bio)) {
+ switch (bio_op(rq->bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
return 1;
}
- for_each_bio(bio) {
- bio_for_each_bvec(bv, bio, iter)
- bvec_split_segs(q, &bv, &nr_phys_segs, NULL, UINT_MAX);
- }
-
+ rq_for_each_bvec(bv, rq, iter)
+ bvec_split_segs(rq->q, &bv, &nr_phys_segs, NULL, UINT_MAX);
return nr_phys_segs;
}
-void blk_recalc_rq_segments(struct request *rq)
-{
- rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
-}
-
-void blk_recount_segments(struct request_queue *q, struct bio *bio)
-{
- struct bio *nxt = bio->bi_next;
-
- bio->bi_next = NULL;
- bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
- bio->bi_next = nxt;
-
- bio_set_flag(bio, BIO_SEG_VALID);
-}
-
static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
struct scatterlist *sglist)
{
}
EXPORT_SYMBOL(blk_rq_map_sg);
-static inline int ll_new_hw_segment(struct request_queue *q,
- struct request *req,
- struct bio *bio)
+static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
+ unsigned int nr_phys_segs)
{
- int nr_phys_segs = bio_phys_segments(q, bio);
-
- if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q))
+ if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(req->q))
goto no_merge;
- if (blk_integrity_merge_bio(q, req, bio) == false)
+ if (blk_integrity_merge_bio(req->q, req, bio) == false)
goto no_merge;
/*
return 1;
no_merge:
- req_set_nomerge(q, req);
+ req_set_nomerge(req->q, req);
return 0;
}
-int ll_back_merge_fn(struct request_queue *q, struct request *req,
- struct bio *bio)
+int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
{
if (req_gap_back_merge(req, bio))
return 0;
return 0;
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
- req_set_nomerge(q, req);
+ req_set_nomerge(req->q, req);
return 0;
}
- if (!bio_flagged(req->biotail, BIO_SEG_VALID))
- blk_recount_segments(q, req->biotail);
- if (!bio_flagged(bio, BIO_SEG_VALID))
- blk_recount_segments(q, bio);
- return ll_new_hw_segment(q, req, bio);
+ return ll_new_hw_segment(req, bio, nr_segs);
}
-int ll_front_merge_fn(struct request_queue *q, struct request *req,
- struct bio *bio)
+int ll_front_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
{
-
if (req_gap_front_merge(req, bio))
return 0;
if (blk_integrity_rq(req) &&
return 0;
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) {
- req_set_nomerge(q, req);
+ req_set_nomerge(req->q, req);
return 0;
}
- if (!bio_flagged(bio, BIO_SEG_VALID))
- blk_recount_segments(q, bio);
- if (!bio_flagged(req->bio, BIO_SEG_VALID))
- blk_recount_segments(q, req->bio);
- return ll_new_hw_segment(q, req, bio);
+ return ll_new_hw_segment(req, bio, nr_segs);
}
static bool req_attempt_discard_merge(struct request_queue *q, struct request *req,
static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
{
if (stat->nr_samples) {
- seq_printf(m, "samples=%d, mean=%lld, min=%llu, max=%llu",
+ seq_printf(m, "samples=%d, mean=%llu, min=%llu, max=%llu",
stat->nr_samples, stat->mean, stat->min, stat->max);
} else {
seq_puts(m, "samples=0");
struct request_queue *q = data;
int bucket;
- for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS/2; bucket++) {
- seq_printf(m, "read (%d Bytes): ", 1 << (9+bucket));
- print_stat(m, &q->poll_stat[2*bucket]);
+ for (bucket = 0; bucket < (BLK_MQ_POLL_STATS_BKTS / 2); bucket++) {
+ seq_printf(m, "read (%d Bytes): ", 1 << (9 + bucket));
+ print_stat(m, &q->poll_stat[2 * bucket]);
seq_puts(m, "\n");
- seq_printf(m, "write (%d Bytes): ", 1 << (9+bucket));
- print_stat(m, &q->poll_stat[2*bucket+1]);
+ seq_printf(m, "write (%d Bytes): ", 1 << (9 + bucket));
+ print_stat(m, &q->poll_stat[2 * bucket + 1]);
seq_puts(m, "\n");
}
return 0;
return 0;
}
-#define REQ_OP_NAME(name) [REQ_OP_##name] = #name
-static const char *const op_name[] = {
- REQ_OP_NAME(READ),
- REQ_OP_NAME(WRITE),
- REQ_OP_NAME(FLUSH),
- REQ_OP_NAME(DISCARD),
- REQ_OP_NAME(SECURE_ERASE),
- REQ_OP_NAME(ZONE_RESET),
- REQ_OP_NAME(WRITE_SAME),
- REQ_OP_NAME(WRITE_ZEROES),
- REQ_OP_NAME(SCSI_IN),
- REQ_OP_NAME(SCSI_OUT),
- REQ_OP_NAME(DRV_IN),
- REQ_OP_NAME(DRV_OUT),
-};
-#undef REQ_OP_NAME
-
#define CMD_FLAG_NAME(name) [__REQ_##name] = #name
static const char *const cmd_flag_name[] = {
CMD_FLAG_NAME(FAILFAST_DEV),
int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq)
{
const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
- const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
+ const unsigned int op = req_op(rq);
+ const char *op_str = blk_op_str(op);
seq_printf(m, "%p {.op=", rq);
- if (op < ARRAY_SIZE(op_name) && op_name[op])
- seq_printf(m, "%s", op_name[op]);
+ if (strcmp(op_str, "UNKNOWN") == 0)
+ seq_printf(m, "%u", op);
else
- seq_printf(m, "%d", op);
+ seq_printf(m, "%s", op_str);
seq_puts(m, ", .cmd_flags=");
blk_flags_show(m, rq->cmd_flags & ~REQ_OP_MASK, cmd_flag_name,
ARRAY_SIZE(cmd_flag_name));
if (attr->show)
return single_release(inode, file);
- else
- return seq_release(inode, file);
+
+ return seq_release(inode, file);
}
static const struct file_operations blk_mq_debugfs_fops = {
}
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
- struct request **merged_request)
+ unsigned int nr_segs, struct request **merged_request)
{
struct request *rq;
case ELEVATOR_BACK_MERGE:
if (!blk_mq_sched_allow_merge(q, rq, bio))
return false;
- if (!bio_attempt_back_merge(q, rq, bio))
+ if (!bio_attempt_back_merge(rq, bio, nr_segs))
return false;
*merged_request = attempt_back_merge(q, rq);
if (!*merged_request)
case ELEVATOR_FRONT_MERGE:
if (!blk_mq_sched_allow_merge(q, rq, bio))
return false;
- if (!bio_attempt_front_merge(q, rq, bio))
+ if (!bio_attempt_front_merge(rq, bio, nr_segs))
return false;
*merged_request = attempt_front_merge(q, rq);
if (!*merged_request)
* of them.
*/
bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
- struct bio *bio)
+ struct bio *bio, unsigned int nr_segs)
{
struct request *rq;
int checked = 8;
switch (blk_try_merge(rq, bio)) {
case ELEVATOR_BACK_MERGE:
if (blk_mq_sched_allow_merge(q, rq, bio))
- merged = bio_attempt_back_merge(q, rq, bio);
+ merged = bio_attempt_back_merge(rq, bio,
+ nr_segs);
break;
case ELEVATOR_FRONT_MERGE:
if (blk_mq_sched_allow_merge(q, rq, bio))
- merged = bio_attempt_front_merge(q, rq, bio);
+ merged = bio_attempt_front_merge(rq, bio,
+ nr_segs);
break;
case ELEVATOR_DISCARD_MERGE:
merged = bio_attempt_discard_merge(q, rq, bio);
*/
static bool blk_mq_attempt_merge(struct request_queue *q,
struct blk_mq_hw_ctx *hctx,
- struct blk_mq_ctx *ctx, struct bio *bio)
+ struct blk_mq_ctx *ctx, struct bio *bio,
+ unsigned int nr_segs)
{
enum hctx_type type = hctx->type;
lockdep_assert_held(&ctx->lock);
- if (blk_mq_bio_list_merge(q, &ctx->rq_lists[type], bio)) {
+ if (blk_mq_bio_list_merge(q, &ctx->rq_lists[type], bio, nr_segs)) {
ctx->rq_merged++;
return true;
}
return false;
}
-bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
+ unsigned int nr_segs)
{
struct elevator_queue *e = q->elevator;
struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
bool ret = false;
enum hctx_type type;
- if (e && e->type->ops.bio_merge) {
- blk_mq_put_ctx(ctx);
- return e->type->ops.bio_merge(hctx, bio);
- }
+ if (e && e->type->ops.bio_merge)
+ return e->type->ops.bio_merge(hctx, bio, nr_segs);
type = hctx->type;
if ((hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
!list_empty_careful(&ctx->rq_lists[type])) {
/* default per sw-queue merge */
spin_lock(&ctx->lock);
- ret = blk_mq_attempt_merge(q, hctx, ctx, bio);
+ ret = blk_mq_attempt_merge(q, hctx, ctx, bio, nr_segs);
spin_unlock(&ctx->lock);
}
- blk_mq_put_ctx(ctx);
return ret;
}
void blk_mq_sched_request_inserted(struct request *rq);
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
- struct request **merged_request);
-bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
+ unsigned int nr_segs, struct request **merged_request);
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
+ unsigned int nr_segs);
bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx);
void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
void blk_mq_sched_free_requests(struct request_queue *q);
static inline bool
-blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
+ unsigned int nr_segs)
{
if (blk_queue_nomerges(q) || !bio_mergeable(bio))
return false;
- return __blk_mq_sched_bio_merge(q, bio);
+ return __blk_mq_sched_bio_merge(q, bio, nr_segs);
}
static inline bool
struct sbq_wait_state *ws;
DEFINE_SBQ_WAIT(wait);
unsigned int tag_offset;
- bool drop_ctx;
int tag;
if (data->flags & BLK_MQ_REQ_RESERVED) {
return BLK_MQ_TAG_FAIL;
ws = bt_wait_ptr(bt, data->hctx);
- drop_ctx = data->ctx == NULL;
do {
struct sbitmap_queue *bt_prev;
if (tag != -1)
break;
- if (data->ctx)
- blk_mq_put_ctx(data->ctx);
-
bt_prev = bt;
io_schedule();
ws = bt_wait_ptr(bt, data->hctx);
} while (1);
- if (drop_ctx && data->ctx)
- blk_mq_put_ctx(data->ctx);
-
sbitmap_finish_wait(bt, ws, &wait);
found_tag:
struct elevator_queue *e = q->elevator;
struct request *rq;
unsigned int tag;
- bool put_ctx_on_error = false;
+ bool clear_ctx_on_error = false;
blk_queue_enter_live(q);
data->q = q;
if (likely(!data->ctx)) {
data->ctx = blk_mq_get_ctx(q);
- put_ctx_on_error = true;
+ clear_ctx_on_error = true;
}
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->cmd_flags,
tag = blk_mq_get_tag(data);
if (tag == BLK_MQ_TAG_FAIL) {
- if (put_ctx_on_error) {
- blk_mq_put_ctx(data->ctx);
+ if (clear_ctx_on_error)
data->ctx = NULL;
- }
blk_queue_exit(q);
return NULL;
}
if (!rq)
return ERR_PTR(-EWOULDBLOCK);
- blk_mq_put_ctx(alloc_data.ctx);
-
rq->__data_len = 0;
rq->__sector = (sector_t) -1;
rq->bio = rq->biotail = NULL;
}
}
-static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
+static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
+ unsigned int nr_segs)
{
- blk_init_request_from_bio(rq, bio);
+ if (bio->bi_opf & REQ_RAHEAD)
+ rq->cmd_flags |= REQ_FAILFAST_MASK;
+
+ rq->__sector = bio->bi_iter.bi_sector;
+ rq->write_hint = bio->bi_write_hint;
+ blk_rq_bio_prep(rq, bio, nr_segs);
blk_account_io_start(rq, true);
}
struct request *rq;
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
+ unsigned int nr_segs;
blk_qc_t cookie;
blk_queue_bounce(q, &bio);
-
- blk_queue_split(q, &bio);
+ __blk_queue_split(q, &bio, &nr_segs);
if (!bio_integrity_prep(bio))
return BLK_QC_T_NONE;
if (!is_flush_fua && !blk_queue_nomerges(q) &&
- blk_attempt_plug_merge(q, bio, &same_queue_rq))
+ blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq))
return BLK_QC_T_NONE;
- if (blk_mq_sched_bio_merge(q, bio))
+ if (blk_mq_sched_bio_merge(q, bio, nr_segs))
return BLK_QC_T_NONE;
rq_qos_throttle(q, bio);
cookie = request_to_qc_t(data.hctx, rq);
+ blk_mq_bio_to_request(rq, bio, nr_segs);
+
plug = current->plug;
if (unlikely(is_flush_fua)) {
- blk_mq_put_ctx(data.ctx);
- blk_mq_bio_to_request(rq, bio);
-
/* bypass scheduler for flush rq */
blk_insert_flush(rq);
blk_mq_run_hw_queue(data.hctx, true);
unsigned int request_count = plug->rq_count;
struct request *last = NULL;
- blk_mq_put_ctx(data.ctx);
- blk_mq_bio_to_request(rq, bio);
-
if (!request_count)
trace_block_plug(q);
else
blk_add_rq_to_plug(plug, rq);
} else if (plug && !blk_queue_nomerges(q)) {
- blk_mq_bio_to_request(rq, bio);
-
/*
* We do limited plugging. If the bio can be merged, do that.
* Otherwise the existing request in the plug list will be
blk_add_rq_to_plug(plug, rq);
trace_block_plug(q);
- blk_mq_put_ctx(data.ctx);
-
if (same_queue_rq) {
data.hctx = same_queue_rq->mq_hctx;
trace_block_unplug(q, 1, true);
}
} else if ((q->nr_hw_queues > 1 && is_sync) || (!q->elevator &&
!data.hctx->dispatch_busy)) {
- blk_mq_put_ctx(data.ctx);
- blk_mq_bio_to_request(rq, bio);
blk_mq_try_issue_directly(data.hctx, rq, &cookie);
} else {
- blk_mq_put_ctx(data.ctx);
- blk_mq_bio_to_request(rq, bio);
blk_mq_sched_insert_request(rq, false, true, true);
}
*/
static inline struct blk_mq_ctx *blk_mq_get_ctx(struct request_queue *q)
{
- return __blk_mq_get_ctx(q, get_cpu());
-}
-
-static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
-{
- put_cpu();
+ return __blk_mq_get_ctx(q, raw_smp_processor_id());
}
struct blk_mq_alloc_data {
int node, int cmd_size, gfp_t flags);
void blk_free_flush_queue(struct blk_flush_queue *q);
-void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
- struct bio *bio);
void blk_freeze_queue(struct request_queue *q);
static inline void blk_queue_enter_live(struct request_queue *q)
return __bvec_gap_to_prev(q, bprv, offset);
}
+static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
+ unsigned int nr_segs)
+{
+ rq->nr_phys_segments = nr_segs;
+ rq->__data_len = bio->bi_iter.bi_size;
+ rq->bio = rq->biotail = bio;
+ rq->ioprio = bio_prio(bio);
+
+ if (bio->bi_disk)
+ rq->rq_disk = bio->bi_disk;
+}
+
#ifdef CONFIG_BLK_DEV_INTEGRITY
void blk_flush_integrity(void);
bool __bio_integrity_endio(struct bio *);
unsigned long blk_rq_timeout(unsigned long timeout);
void blk_add_timer(struct request *req);
-bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
- struct bio *bio);
-bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
- struct bio *bio);
+bool bio_attempt_front_merge(struct request *req, struct bio *bio,
+ unsigned int nr_segs);
+bool bio_attempt_back_merge(struct request *req, struct bio *bio,
+ unsigned int nr_segs);
bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
struct bio *bio);
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
- struct request **same_queue_rq);
+ unsigned int nr_segs, struct request **same_queue_rq);
void blk_account_io_start(struct request *req, bool new_io);
void blk_account_io_completion(struct request *req, unsigned int bytes);
}
#endif
-int ll_back_merge_fn(struct request_queue *q, struct request *req,
- struct bio *bio);
-int ll_front_merge_fn(struct request_queue *q, struct request *req,
- struct bio *bio);
+void __blk_queue_split(struct request_queue *q, struct bio **bio,
+ unsigned int *nr_segs);
+int ll_back_merge_fn(struct request *req, struct bio *bio,
+ unsigned int nr_segs);
+int ll_front_merge_fn(struct request *req, struct bio *bio,
+ unsigned int nr_segs);
struct request *attempt_back_merge(struct request_queue *q, struct request *rq);
struct request *attempt_front_merge(struct request_queue *q, struct request *rq);
int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
struct request *next);
-void blk_recalc_rq_segments(struct request *rq);
+unsigned int blk_recalc_rq_segments(struct request *rq);
void blk_rq_set_mixed_merge(struct request *rq);
bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);
struct disk_part_tbl *new_ptbl;
int len = old_ptbl ? old_ptbl->len : 0;
int i, target;
- size_t size;
/*
* check for int overflow, since we can get here from blkpg_ioctl()
if (target <= len)
return 0;
- size = sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0]);
- new_ptbl = kzalloc_node(size, GFP_KERNEL, disk->node_id);
+ new_ptbl = kzalloc_node(struct_size(new_ptbl, part, target), GFP_KERNEL,
+ disk->node_id);
if (!new_ptbl)
return -ENOMEM;
}
}
-static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+ unsigned int nr_segs)
{
struct kyber_hctx_data *khd = hctx->sched_data;
struct blk_mq_ctx *ctx = blk_mq_get_ctx(hctx->queue);
bool merged;
spin_lock(&kcq->lock);
- merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio);
+ merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
spin_unlock(&kcq->lock);
- blk_mq_put_ctx(ctx);
return merged;
}
return ELEVATOR_NO_MERGE;
}
-static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+ unsigned int nr_segs)
{
struct request_queue *q = hctx->queue;
struct deadline_data *dd = q->elevator->elevator_data;
bool ret;
spin_lock(&dd->lock);
- ret = blk_mq_sched_try_merge(q, bio, &free);
+ ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
spin_unlock(&dd->lock);
if (free)
OPAL_ENTERPRISE_BANDMASTER0_UID,
OPAL_ENTERPRISE_ERASEMASTER_UID,
/* tables */
+ OPAL_TABLE_TABLE,
OPAL_LOCKINGRANGE_GLOBAL,
OPAL_LOCKINGRANGE_ACE_RDLOCKED,
OPAL_LOCKINGRANGE_ACE_WRLOCKED,
OPAL_STARTCOLUMN = 0x03,
OPAL_ENDCOLUMN = 0x04,
OPAL_VALUES = 0x01,
+ /* table table */
+ OPAL_TABLE_UID = 0x00,
+ OPAL_TABLE_NAME = 0x01,
+ OPAL_TABLE_COMMON = 0x02,
+ OPAL_TABLE_TEMPLATE = 0x03,
+ OPAL_TABLE_KIND = 0x04,
+ OPAL_TABLE_COLUMN = 0x05,
+ OPAL_TABLE_COLUMNS = 0x06,
+ OPAL_TABLE_ROWS = 0x07,
+ OPAL_TABLE_ROWS_FREE = 0x08,
+ OPAL_TABLE_ROW_BYTES = 0x09,
+ OPAL_TABLE_LASTID = 0x0A,
+ OPAL_TABLE_MIN = 0x0B,
+ OPAL_TABLE_MAX = 0x0C,
+
/* authority table */
OPAL_PIN = 0x03,
/* locking tokens */
#define IO_BUFFER_LENGTH 2048
#define MAX_TOKS 64
+/* Number of bytes needed by cmd_finalize. */
+#define CMD_FINALIZE_BYTES_NEEDED 7
+
struct opal_step {
int (*fn)(struct opal_dev *dev, void *data);
void *data;
/* tables */
+ [OPAL_TABLE_TABLE]
+ { 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01 },
[OPAL_LOCKINGRANGE_GLOBAL] =
{ 0x00, 0x00, 0x08, 0x02, 0x00, 0x00, 0x00, 0x01 },
[OPAL_LOCKINGRANGE_ACE_RDLOCKED] =
return execute_step(dev, &discovery0_step, 0);
}
+static size_t remaining_size(struct opal_dev *cmd)
+{
+ return IO_BUFFER_LENGTH - cmd->pos;
+}
+
static bool can_add(int *err, struct opal_dev *cmd, size_t len)
{
if (*err)
return false;
- if (len > IO_BUFFER_LENGTH || cmd->pos > IO_BUFFER_LENGTH - len) {
+ if (remaining_size(cmd) < len) {
pr_debug("Error adding %zu bytes: end of buffer.\n", len);
*err = -ERANGE;
return false;
struct opal_header *hdr;
int err = 0;
- /* close the parameter list opened from cmd_start */
+ /*
+ * Close the parameter list opened from cmd_start.
+ * The number of bytes added must be equal to
+ * CMD_FINALIZE_BYTES_NEEDED.
+ */
add_token_u8(&err, cmd, OPAL_ENDLIST);
add_token_u8(&err, cmd, OPAL_ENDOFDATA);
return finalize_and_send(dev, parse_and_check_status);
}
+/*
+ * see TCG SAS 5.3.2.3 for a description of the available columns
+ *
+ * the result is provided in dev->resp->tok[4]
+ */
+static int generic_get_table_info(struct opal_dev *dev, enum opal_uid table,
+ u64 column)
+{
+ u8 uid[OPAL_UID_LENGTH];
+ const unsigned int half = OPAL_UID_LENGTH/2;
+
+ /* sed-opal UIDs can be split in two halves:
+ * first: actual table index
+ * second: relative index in the table
+ * so we have to get the first half of the OPAL_TABLE_TABLE and use the
+ * first part of the target table as relative index into that table
+ */
+ memcpy(uid, opaluid[OPAL_TABLE_TABLE], half);
+ memcpy(uid+half, opaluid[table], half);
+
+ return generic_get_column(dev, uid, column);
+}
+
static int gen_key(struct opal_dev *dev, void *data)
{
u8 uid[OPAL_UID_LENGTH];
break;
case OPAL_ADMIN1_UID:
case OPAL_SID_UID:
+ case OPAL_PSID_UID:
add_token_u8(&err, dev, OPAL_STARTNAME);
add_token_u8(&err, dev, 0); /* HostChallenge */
add_token_bytestring(&err, dev, key, key_len);
key->key, key->key_len);
}
+static int start_PSID_opal_session(struct opal_dev *dev, void *data)
+{
+ const struct opal_key *okey = data;
+
+ return start_generic_opal_session(dev, OPAL_PSID_UID,
+ OPAL_ADMINSP_UID,
+ okey->key,
+ okey->key_len);
+}
+
static int start_auth_opal_session(struct opal_dev *dev, void *data)
{
struct opal_session_info *session = data;
return finalize_and_send(dev, parse_and_check_status);
}
+static int write_shadow_mbr(struct opal_dev *dev, void *data)
+{
+ struct opal_shadow_mbr *shadow = data;
+ const u8 __user *src;
+ u8 *dst;
+ size_t off = 0;
+ u64 len;
+ int err = 0;
+
+ /* do we fit in the available shadow mbr space? */
+ err = generic_get_table_info(dev, OPAL_MBR, OPAL_TABLE_ROWS);
+ if (err) {
+ pr_debug("MBR: could not get shadow size\n");
+ return err;
+ }
+
+ len = response_get_u64(&dev->parsed, 4);
+ if (shadow->size > len || shadow->offset > len - shadow->size) {
+ pr_debug("MBR: does not fit in shadow (%llu vs. %llu)\n",
+ shadow->offset + shadow->size, len);
+ return -ENOSPC;
+ }
+
+ /* do the actual transmission(s) */
+ src = (u8 __user *)(uintptr_t)shadow->data;
+ while (off < shadow->size) {
+ err = cmd_start(dev, opaluid[OPAL_MBR], opalmethod[OPAL_SET]);
+ add_token_u8(&err, dev, OPAL_STARTNAME);
+ add_token_u8(&err, dev, OPAL_WHERE);
+ add_token_u64(&err, dev, shadow->offset + off);
+ add_token_u8(&err, dev, OPAL_ENDNAME);
+
+ add_token_u8(&err, dev, OPAL_STARTNAME);
+ add_token_u8(&err, dev, OPAL_VALUES);
+
+ /*
+ * The bytestring header is either 1 or 2 bytes, so assume 2.
+ * There also needs to be enough space to accommodate the
+ * trailing OPAL_ENDNAME (1 byte) and tokens added by
+ * cmd_finalize.
+ */
+ len = min(remaining_size(dev) - (2+1+CMD_FINALIZE_BYTES_NEEDED),
+ (size_t)(shadow->size - off));
+ pr_debug("MBR: write bytes %zu+%llu/%llu\n",
+ off, len, shadow->size);
+
+ dst = add_bytestring_header(&err, dev, len);
+ if (!dst)
+ break;
+ if (copy_from_user(dst, src + off, len))
+ err = -EFAULT;
+ dev->pos += len;
+
+ add_token_u8(&err, dev, OPAL_ENDNAME);
+ if (err)
+ break;
+
+ err = finalize_and_send(dev, parse_and_check_status);
+ if (err)
+ break;
+
+ off += len;
+ }
+ return err;
+}
+
static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid,
struct opal_dev *dev)
{
return ret;
}
+static int opal_set_mbr_done(struct opal_dev *dev,
+ struct opal_mbr_done *mbr_done)
+{
+ u8 mbr_done_tf = mbr_done->done_flag == OPAL_MBR_DONE ?
+ OPAL_TRUE : OPAL_FALSE;
+
+ const struct opal_step mbr_steps[] = {
+ { start_admin1LSP_opal_session, &mbr_done->key },
+ { set_mbr_done, &mbr_done_tf },
+ { end_opal_session, }
+ };
+ int ret;
+
+ if (mbr_done->done_flag != OPAL_MBR_DONE &&
+ mbr_done->done_flag != OPAL_MBR_NOT_DONE)
+ return -EINVAL;
+
+ mutex_lock(&dev->dev_lock);
+ setup_opal_dev(dev);
+ ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps));
+ mutex_unlock(&dev->dev_lock);
+ return ret;
+}
+
+static int opal_write_shadow_mbr(struct opal_dev *dev,
+ struct opal_shadow_mbr *info)
+{
+ const struct opal_step mbr_steps[] = {
+ { start_admin1LSP_opal_session, &info->key },
+ { write_shadow_mbr, info },
+ { end_opal_session, }
+ };
+ int ret;
+
+ if (info->size == 0)
+ return 0;
+
+ mutex_lock(&dev->dev_lock);
+ setup_opal_dev(dev);
+ ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps));
+ mutex_unlock(&dev->dev_lock);
+ return ret;
+}
+
static int opal_save(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk)
{
struct opal_suspend_data *suspend;
return ret;
}
-static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal)
+static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal, bool psid)
{
+ /* controller will terminate session */
const struct opal_step revert_steps[] = {
{ start_SIDASP_opal_session, opal },
- { revert_tper, } /* controller will terminate session */
+ { revert_tper, }
+ };
+ const struct opal_step psid_revert_steps[] = {
+ { start_PSID_opal_session, opal },
+ { revert_tper, }
};
+
int ret;
mutex_lock(&dev->dev_lock);
setup_opal_dev(dev);
- ret = execute_steps(dev, revert_steps, ARRAY_SIZE(revert_steps));
+ if (psid)
+ ret = execute_steps(dev, psid_revert_steps,
+ ARRAY_SIZE(psid_revert_steps));
+ else
+ ret = execute_steps(dev, revert_steps,
+ ARRAY_SIZE(revert_steps));
mutex_unlock(&dev->dev_lock);
/*
{
int ret;
- if (lk_unlk->session.who < OPAL_ADMIN1 ||
- lk_unlk->session.who > OPAL_USER9)
+ if (lk_unlk->session.who > OPAL_USER9)
return -EINVAL;
mutex_lock(&dev->dev_lock);
};
int ret;
- if (opal_pw->session.who < OPAL_ADMIN1 ||
- opal_pw->session.who > OPAL_USER9 ||
- opal_pw->new_user_pw.who < OPAL_ADMIN1 ||
+ if (opal_pw->session.who > OPAL_USER9 ||
opal_pw->new_user_pw.who > OPAL_USER9)
return -EINVAL;
ret = opal_activate_user(dev, p);
break;
case IOC_OPAL_REVERT_TPR:
- ret = opal_reverttper(dev, p);
+ ret = opal_reverttper(dev, p, false);
break;
case IOC_OPAL_LR_SETUP:
ret = opal_setup_locking_range(dev, p);
case IOC_OPAL_ENABLE_DISABLE_MBR:
ret = opal_enable_disable_shadow_mbr(dev, p);
break;
+ case IOC_OPAL_MBR_DONE:
+ ret = opal_set_mbr_done(dev, p);
+ break;
+ case IOC_OPAL_WRITE_SHADOW_MBR:
+ ret = opal_write_shadow_mbr(dev, p);
+ break;
case IOC_OPAL_ERASE_LR:
ret = opal_erase_locking_range(dev, p);
break;
case IOC_OPAL_SECURE_ERASE_LR:
ret = opal_secure_erase_locking_range(dev, p);
break;
+ case IOC_OPAL_PSID_REVERT_TPR:
+ ret = opal_reverttper(dev, p, true);
+ break;
default:
break;
}
void drbd_debugfs_resource_add(struct drbd_resource *resource)
{
struct dentry *dentry;
- if (!drbd_debugfs_resources)
- return;
dentry = debugfs_create_dir(resource->name, drbd_debugfs_resources);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
resource->debugfs_res = dentry;
dentry = debugfs_create_dir("volumes", resource->debugfs_res);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
resource->debugfs_res_volumes = dentry;
dentry = debugfs_create_dir("connections", resource->debugfs_res);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
resource->debugfs_res_connections = dentry;
dentry = debugfs_create_file("in_flight_summary", 0440,
resource->debugfs_res, resource,
&in_flight_summary_fops);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
resource->debugfs_res_in_flight_summary = dentry;
- return;
-
-fail:
- drbd_debugfs_resource_cleanup(resource);
- drbd_err(resource, "failed to create debugfs dentry\n");
}
static void drbd_debugfs_remove(struct dentry **dp)
{
struct dentry *conns_dir = connection->resource->debugfs_res_connections;
struct dentry *dentry;
- if (!conns_dir)
- return;
/* Once we enable mutliple peers,
* these connections will have descriptive names.
* For now, it is just the one connection to the (only) "peer". */
dentry = debugfs_create_dir("peer", conns_dir);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
connection->debugfs_conn = dentry;
dentry = debugfs_create_file("callback_history", 0440,
connection->debugfs_conn, connection,
&connection_callback_history_fops);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
connection->debugfs_conn_callback_history = dentry;
dentry = debugfs_create_file("oldest_requests", 0440,
connection->debugfs_conn, connection,
&connection_oldest_requests_fops);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
connection->debugfs_conn_oldest_requests = dentry;
- return;
-
-fail:
- drbd_debugfs_connection_cleanup(connection);
- drbd_err(connection, "failed to create debugfs dentry\n");
}
void drbd_debugfs_connection_cleanup(struct drbd_connection *connection)
snprintf(vnr_buf, sizeof(vnr_buf), "%u", device->vnr);
dentry = debugfs_create_dir(vnr_buf, vols_dir);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
device->debugfs_vol = dentry;
snprintf(minor_buf, sizeof(minor_buf), "%u", device->minor);
if (!slink_name)
goto fail;
dentry = debugfs_create_symlink(minor_buf, drbd_debugfs_minors, slink_name);
+ device->debugfs_minor = dentry;
kfree(slink_name);
slink_name = NULL;
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
- device->debugfs_minor = dentry;
#define DCF(name) do { \
dentry = debugfs_create_file(#name, 0440, \
device->debugfs_vol, device, \
&device_ ## name ## _fops); \
- if (IS_ERR_OR_NULL(dentry)) \
- goto fail; \
device->debugfs_vol_ ## name = dentry; \
} while (0)
struct dentry *dentry;
char vnr_buf[8];
- if (!conn_dir)
- return;
-
snprintf(vnr_buf, sizeof(vnr_buf), "%u", peer_device->device->vnr);
dentry = debugfs_create_dir(vnr_buf, conn_dir);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
peer_device->debugfs_peer_dev = dentry;
- return;
-
-fail:
- drbd_debugfs_peer_device_cleanup(peer_device);
- drbd_err(peer_device, "failed to create debugfs entries\n");
}
void drbd_debugfs_peer_device_cleanup(struct drbd_peer_device *peer_device)
drbd_debugfs_remove(&drbd_debugfs_root);
}
-int __init drbd_debugfs_init(void)
+void __init drbd_debugfs_init(void)
{
struct dentry *dentry;
dentry = debugfs_create_dir("drbd", NULL);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
drbd_debugfs_root = dentry;
dentry = debugfs_create_file("version", 0444, drbd_debugfs_root, NULL, &drbd_version_fops);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
drbd_debugfs_version = dentry;
dentry = debugfs_create_dir("resources", drbd_debugfs_root);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
drbd_debugfs_resources = dentry;
dentry = debugfs_create_dir("minors", drbd_debugfs_root);
- if (IS_ERR_OR_NULL(dentry))
- goto fail;
drbd_debugfs_minors = dentry;
- return 0;
-
-fail:
- drbd_debugfs_cleanup();
- if (dentry)
- return PTR_ERR(dentry);
- else
- return -EINVAL;
}
#include "drbd_int.h"
#ifdef CONFIG_DEBUG_FS
-int __init drbd_debugfs_init(void);
+void __init drbd_debugfs_init(void);
void drbd_debugfs_cleanup(void);
void drbd_debugfs_resource_add(struct drbd_resource *resource);
void drbd_debugfs_peer_device_cleanup(struct drbd_peer_device *peer_device);
#else
-static inline int __init drbd_debugfs_init(void) { return -ENODEV; }
+static inline void __init drbd_debugfs_init(void) { }
static inline void drbd_debugfs_cleanup(void) { }
static inline void drbd_debugfs_resource_add(struct drbd_resource *resource) { }
spin_lock_init(&retry.lock);
INIT_LIST_HEAD(&retry.writes);
- if (drbd_debugfs_init())
- pr_notice("failed to initialize debugfs -- will not be available\n");
+ drbd_debugfs_init();
pr_info("initialized. "
"Version: " REL_VERSION " (api:%d/proto:%d-%d)\n",
if (!UDP->cmos)
UDP->cmos = FLOPPY0_TYPE;
drive = 1;
- if (!UDP->cmos && FLOPPY1_TYPE)
+ if (!UDP->cmos)
UDP->cmos = FLOPPY1_TYPE;
/* FIXME: additional physical CMOS drive detection should go here */
return ret;
}
-static inline void loop_iov_iter_bvec(struct iov_iter *i,
- unsigned int direction, const struct bio_vec *bvec,
- unsigned long nr_segs, size_t count)
-{
- iov_iter_bvec(i, direction, bvec, nr_segs, count);
- i->type |= ITER_BVEC_FLAG_NO_REF;
-}
-
static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
{
struct iov_iter i;
ssize_t bw;
- loop_iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len);
+ iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len);
file_start_write(file);
bw = vfs_iter_write(file, &i, ppos, 0);
ssize_t len;
rq_for_each_segment(bvec, rq, iter) {
- loop_iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len);
+ iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len);
len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0);
if (len < 0)
return len;
b.bv_offset = 0;
b.bv_len = bvec.bv_len;
- loop_iov_iter_bvec(&i, READ, &b, 1, b.bv_len);
+ iov_iter_bvec(&i, READ, &b, 1, b.bv_len);
len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0);
if (len < 0) {
ret = len;
}
atomic_set(&cmd->ref, 2);
- loop_iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
+ iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
cmd->iocb.ki_pos = pos;
ATA_SECT_SIZE * xfer_sz);
return -ENOMEM;
}
- memset(buf, 0, ATA_SECT_SIZE * xfer_sz);
}
/* Build the FIS. */
&port->block1_dma, GFP_KERNEL);
if (!port->block1)
return -ENOMEM;
- memset(port->block1, 0, BLOCK_DMA_ALLOC_SZ);
/* Allocate dma memory for command list */
port->command_list =
port->block1_dma = 0;
return -ENOMEM;
}
- memset(port->command_list, 0, AHCI_CMD_TBL_SZ);
/* Setup all pointers into first DMA region */
port->rxfis = port->block1 + AHCI_RX_FIS_OFFSET;
if (!cmd->command)
return -ENOMEM;
- memset(cmd->command, 0, CMD_DMA_ALLOC_SZ);
-
sg_init_table(cmd->sg, MTIP_MAX_SG);
return 0;
}
set_bit(NULLB_DEV_FL_CONFIGURED, &dev->flags);
dev->power = newp;
} else if (dev->power && !newp) {
- mutex_lock(&lock);
- dev->power = newp;
- null_del_dev(dev->nullb);
- mutex_unlock(&lock);
- clear_bit(NULLB_DEV_FL_UP, &dev->flags);
+ if (test_and_clear_bit(NULLB_DEV_FL_UP, &dev->flags)) {
+ mutex_lock(&lock);
+ dev->power = newp;
+ null_del_dev(dev->nullb);
+ mutex_unlock(&lock);
+ }
clear_bit(NULLB_DEV_FL_CONFIGURED, &dev->flags);
}
if (!cmd->error && dev->zoned) {
sector_t sector;
unsigned int nr_sectors;
- int op;
+ enum req_opf op;
if (dev->queue_mode == NULL_Q_BIO) {
op = bio_op(cmd->bio);
if (!nullb->queues)
return -ENOMEM;
- nullb->nr_queues = 0;
nullb->queue_depth = nullb->dev->hw_queue_depth;
return 0;
(FIT_QCMD_ALIGN - 1),
"not aligned: msg_buf %p mb_dma_address %pad\n",
skmsg->msg_buf, &skmsg->mb_dma_address);
- memset(skmsg->msg_buf, 0, SKD_N_FITMSG_BYTES);
}
err_out:
*/
static int nvm_remove_tgt(struct nvm_ioctl_remove *remove)
{
- struct nvm_target *t;
+ struct nvm_target *t = NULL;
struct nvm_dev *dev;
down_read(&nvm_lock);
void pblk_bio_free_pages(struct pblk *pblk, struct bio *bio, int off,
int nr_pages)
{
- struct bio_vec bv;
- int i;
-
- WARN_ON(off + nr_pages != bio->bi_vcnt);
-
- for (i = off; i < nr_pages + off; i++) {
- bv = bio->bi_io_vec[i];
- mempool_free(bv.bv_page, &pblk->page_bio_pool);
+ struct bio_vec *bv;
+ struct page *page;
+ int i, e, nbv = 0;
+
+ for (i = 0; i < bio->bi_vcnt; i++) {
+ bv = &bio->bi_io_vec[i];
+ page = bv->bv_page;
+ for (e = 0; e < bv->bv_len; e += PBLK_EXPOSED_PAGE_SIZE, nbv++)
+ if (nbv >= off)
+ mempool_free(page++, &pblk->page_bio_pool);
}
}
struct bucket *b;
long r;
+
+ /* No allocation if CACHE_SET_IO_DISABLE bit is set */
+ if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &ca->set->flags)))
+ return -1;
+
/* fastpath */
if (fifo_pop(&ca->free[RESERVE_NONE], r) ||
fifo_pop(&ca->free[reserve], r))
{
int i;
+ /* No allocation if CACHE_SET_IO_DISABLE bit is set */
+ if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &c->flags)))
+ return -1;
+
lockdep_assert_held(&c->bucket_lock);
BUG_ON(!n || n > c->caches_loaded || n > MAX_CACHES_PER_SET);
atomic_long_t writeback_keys_failed;
atomic_long_t reclaim;
+ atomic_long_t reclaimed_journal_buckets;
atomic_long_t flush_write;
- atomic_long_t retry_flush_write;
enum {
ON_ERROR_UNREGISTER,
#define BUCKET_HASH_BITS 12
struct hlist_head bucket_hash[1 << BUCKET_HASH_BITS];
-
- DECLARE_HEAP(struct btree *, flush_btree);
};
struct bbio {
int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
uint8_t *set_uuid);
void bch_cached_dev_detach(struct cached_dev *dc);
-void bch_cached_dev_run(struct cached_dev *dc);
+int bch_cached_dev_run(struct cached_dev *dc);
void bcache_device_stop(struct bcache_device *d);
void bch_cache_set_unregister(struct cache_set *c);
void bch_btree_keys_init(struct btree_keys *b, const struct btree_keys_ops *ops,
bool *expensive_debug_checks)
{
- unsigned int i;
-
b->ops = ops;
b->expensive_debug_checks = expensive_debug_checks;
b->nsets = 0;
b->last_set_unwritten = 0;
- /* XXX: shouldn't be needed */
- for (i = 0; i < MAX_BSETS; i++)
- b->set[i].size = 0;
/*
- * Second loop starts at 1 because b->keys[0]->data is the memory we
- * allocated
+ * struct btree_keys in embedded in struct btree, and struct
+ * bset_tree is embedded into struct btree_keys. They are all
+ * initialized as 0 by kzalloc() in mca_bucket_alloc(), and
+ * b->set[0].data is allocated in bch_btree_keys_alloc(), so we
+ * don't have to initiate b->set[].size and b->set[].data here
+ * any more.
*/
- for (i = 1; i < MAX_BSETS; i++)
- b->set[i].data = NULL;
}
EXPORT_SYMBOL(bch_btree_keys_init);
unsigned int inorder, j, n = 1;
do {
- /*
- * A bit trick here.
- * If p < t->size, (int)(p - t->size) is a minus value and
- * the most significant bit is set, right shifting 31 bits
- * gets 1. If p >= t->size, the most significant bit is
- * not set, right shifting 31 bits gets 0.
- * So the following 2 lines equals to
- * if (p >= t->size)
- * p = 0;
- * but a branch instruction is avoided.
- */
unsigned int p = n << 4;
- p &= ((int) (p - t->size)) >> 31;
-
- prefetch(&t->tree[p]);
+ if (p < t->size)
+ prefetch(&t->tree[p]);
j = n;
f = &t->tree[j];
- /*
- * Similar bit trick, use subtract operation to avoid a branch
- * instruction.
- *
- * n = (f->mantissa > bfloat_mantissa())
- * ? j * 2
- * : j * 2 + 1;
- *
- * We need to subtract 1 from f->mantissa for the sign bit trick
- * to work - that's done in make_bfloat()
- */
- if (likely(f->exponent != 127))
- n = j * 2 + (((unsigned int)
- (f->mantissa -
- bfloat_mantissa(search, f))) >> 31);
- else
- n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
- ? j * 2
- : j * 2 + 1;
+ if (likely(f->exponent != 127)) {
+ if (f->mantissa >= bfloat_mantissa(search, f))
+ n = j * 2;
+ else
+ n = j * 2 + 1;
+ } else {
+ if (bkey_cmp(tree_to_bkey(t, j), search) > 0)
+ n = j * 2;
+ else
+ n = j * 2 + 1;
+ }
} while (n < t->size);
inorder = to_inorder(j, t);
#include <linux/rcupdate.h>
#include <linux/sched/clock.h>
#include <linux/rculist.h>
-
+#include <linux/delay.h>
#include <trace/events/bcache.h>
/*
static struct btree *mca_bucket_alloc(struct cache_set *c,
struct bkey *k, gfp_t gfp)
{
+ /*
+ * kzalloc() is necessary here for initialization,
+ * see code comments in bch_btree_keys_init().
+ */
struct btree *b = kzalloc(sizeof(struct btree), gfp);
if (!b)
up(&b->io_mutex);
}
+retry:
+ /*
+ * BTREE_NODE_dirty might be cleared in btree_flush_btree() by
+ * __bch_btree_node_write(). To avoid an extra flush, acquire
+ * b->write_lock before checking BTREE_NODE_dirty bit.
+ */
mutex_lock(&b->write_lock);
+ /*
+ * If this btree node is selected in btree_flush_write() by journal
+ * code, delay and retry until the node is flushed by journal code
+ * and BTREE_NODE_journal_flush bit cleared by btree_flush_write().
+ */
+ if (btree_node_journal_flush(b)) {
+ pr_debug("bnode %p is flushing by journal, retry", b);
+ mutex_unlock(&b->write_lock);
+ udelay(1);
+ goto retry;
+ }
+
if (btree_node_dirty(b))
__bch_btree_node_write(b, &cl);
mutex_unlock(&b->write_lock);
while (!list_empty(&c->btree_cache)) {
b = list_first_entry(&c->btree_cache, struct btree, list);
- if (btree_node_dirty(b))
+ /*
+ * This function is called by cache_set_free(), no I/O
+ * request on cache now, it is unnecessary to acquire
+ * b->write_lock before clearing BTREE_NODE_dirty anymore.
+ */
+ if (btree_node_dirty(b)) {
btree_complete_write(b, btree_current_write(b));
- clear_bit(BTREE_NODE_dirty, &b->flags);
-
+ clear_bit(BTREE_NODE_dirty, &b->flags);
+ }
mca_data_free(b);
}
BUG_ON(b == b->c->root);
+retry:
mutex_lock(&b->write_lock);
+ /*
+ * If the btree node is selected and flushing in btree_flush_write(),
+ * delay and retry until the BTREE_NODE_journal_flush bit cleared,
+ * then it is safe to free the btree node here. Otherwise this btree
+ * node will be in race condition.
+ */
+ if (btree_node_journal_flush(b)) {
+ mutex_unlock(&b->write_lock);
+ pr_debug("bnode %p journal_flush set, retry", b);
+ udelay(1);
+ goto retry;
+ }
- if (btree_node_dirty(b))
+ if (btree_node_dirty(b)) {
btree_complete_write(b, btree_current_write(b));
- clear_bit(BTREE_NODE_dirty, &b->flags);
+ clear_bit(BTREE_NODE_dirty, &b->flags);
+ }
mutex_unlock(&b->write_lock);
BTREE_NODE_io_error,
BTREE_NODE_dirty,
BTREE_NODE_write_idx,
+ BTREE_NODE_journal_flush,
};
BTREE_FLAG(io_error);
BTREE_FLAG(dirty);
BTREE_FLAG(write_idx);
+BTREE_FLAG(journal_flush);
static inline struct btree_write *btree_current_write(struct btree *b)
{
WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
+ /*
+ * Read-ahead requests on a degrading and recovering md raid
+ * (e.g. raid6) device might be failured immediately by md
+ * raid code, which is not a real hardware media failure. So
+ * we shouldn't count failed REQ_RAHEAD bio to dc->io_errors.
+ */
+ if (bio->bi_opf & REQ_RAHEAD) {
+ pr_warn_ratelimited("%s: Read-ahead I/O failed on backing device, ignore",
+ dc->backing_dev_name);
+ return;
+ }
+
errors = atomic_add_return(1, &dc->io_errors);
if (errors < dc->error_limit)
pr_err("%s: IO error on backing device, unrecoverable",
blocks = set_blocks(j, block_bytes(ca->set));
+ /*
+ * Nodes in 'list' are in linear increasing order of
+ * i->j.seq, the node on head has the smallest (oldest)
+ * journal seq, the node on tail has the biggest
+ * (latest) journal seq.
+ */
+
+ /*
+ * Check from the oldest jset for last_seq. If
+ * i->j.seq < j->last_seq, it means the oldest jset
+ * in list is expired and useless, remove it from
+ * this list. Otherwise, j is a condidate jset for
+ * further following checks.
+ */
while (!list_empty(list)) {
i = list_first_entry(list,
struct journal_replay, list);
kfree(i);
}
+ /* iterate list in reverse order (from latest jset) */
list_for_each_entry_reverse(i, list, list) {
if (j->seq == i->j.seq)
goto next_set;
+ /*
+ * if j->seq is less than any i->j.last_seq
+ * in list, j is an expired and useless jset.
+ */
if (j->seq < i->j.last_seq)
goto next_set;
+ /*
+ * 'where' points to first jset in list which
+ * is elder then j.
+ */
if (j->seq > i->j.seq) {
where = &i->list;
goto add;
if (!i)
return -ENOMEM;
memcpy(&i->j, j, bytes);
+ /* Add to the location after 'where' points to */
list_add(&i->list, where);
ret = 1;
- ja->seq[bucket_index] = j->seq;
+ if (j->seq > ja->seq[bucket_index])
+ ja->seq[bucket_index] = j->seq;
next_set:
offset += blocks * ca->sb.block_size;
len -= blocks * ca->sb.block_size;
struct journal_replay,
list)->j.seq;
- return ret;
+ return 0;
#undef read_bucket
}
}
/* Journalling */
-#define journal_max_cmp(l, r) \
- (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
- fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
-#define journal_min_cmp(l, r) \
- (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
- fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
static void btree_flush_write(struct cache_set *c)
{
- /*
- * Try to find the btree node with that references the oldest journal
- * entry, best is our current candidate and is locked if non NULL:
- */
- struct btree *b;
- int i;
+ struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
+ unsigned int i, n;
+
+ if (c->journal.btree_flushing)
+ return;
+
+ spin_lock(&c->journal.flush_write_lock);
+ if (c->journal.btree_flushing) {
+ spin_unlock(&c->journal.flush_write_lock);
+ return;
+ }
+ c->journal.btree_flushing = true;
+ spin_unlock(&c->journal.flush_write_lock);
atomic_long_inc(&c->flush_write);
+ memset(btree_nodes, 0, sizeof(btree_nodes));
+ n = 0;
-retry:
- spin_lock(&c->journal.lock);
- if (heap_empty(&c->flush_btree)) {
- for_each_cached_btree(b, c, i)
- if (btree_current_write(b)->journal) {
- if (!heap_full(&c->flush_btree))
- heap_add(&c->flush_btree, b,
- journal_max_cmp);
- else if (journal_max_cmp(b,
- heap_peek(&c->flush_btree))) {
- c->flush_btree.data[0] = b;
- heap_sift(&c->flush_btree, 0,
- journal_max_cmp);
- }
- }
+ mutex_lock(&c->bucket_lock);
+ list_for_each_entry_safe_reverse(b, t, &c->btree_cache, list) {
+ if (btree_node_journal_flush(b))
+ pr_err("BUG: flush_write bit should not be set here!");
+
+ mutex_lock(&b->write_lock);
- for (i = c->flush_btree.used / 2 - 1; i >= 0; --i)
- heap_sift(&c->flush_btree, i, journal_min_cmp);
+ if (!btree_node_dirty(b)) {
+ mutex_unlock(&b->write_lock);
+ continue;
+ }
+
+ if (!btree_current_write(b)->journal) {
+ mutex_unlock(&b->write_lock);
+ continue;
+ }
+
+ set_btree_node_journal_flush(b);
+
+ mutex_unlock(&b->write_lock);
+
+ btree_nodes[n++] = b;
+ if (n == BTREE_FLUSH_NR)
+ break;
}
+ mutex_unlock(&c->bucket_lock);
- b = NULL;
- heap_pop(&c->flush_btree, b, journal_min_cmp);
- spin_unlock(&c->journal.lock);
+ for (i = 0; i < n; i++) {
+ b = btree_nodes[i];
+ if (!b) {
+ pr_err("BUG: btree_nodes[%d] is NULL", i);
+ continue;
+ }
+
+ /* safe to check without holding b->write_lock */
+ if (!btree_node_journal_flush(b)) {
+ pr_err("BUG: bnode %p: journal_flush bit cleaned", b);
+ continue;
+ }
- if (b) {
mutex_lock(&b->write_lock);
if (!btree_current_write(b)->journal) {
+ clear_bit(BTREE_NODE_journal_flush, &b->flags);
+ mutex_unlock(&b->write_lock);
+ pr_debug("bnode %p: written by others", b);
+ continue;
+ }
+
+ if (!btree_node_dirty(b)) {
+ clear_bit(BTREE_NODE_journal_flush, &b->flags);
mutex_unlock(&b->write_lock);
- /* We raced */
- atomic_long_inc(&c->retry_flush_write);
- goto retry;
+ pr_debug("bnode %p: dirty bit cleaned by others", b);
+ continue;
}
__bch_btree_node_write(b, NULL);
+ clear_bit(BTREE_NODE_journal_flush, &b->flags);
mutex_unlock(&b->write_lock);
}
+
+ spin_lock(&c->journal.flush_write_lock);
+ c->journal.btree_flushing = false;
+ spin_unlock(&c->journal.flush_write_lock);
}
#define last_seq(j) ((j)->seq - fifo_used(&(j)->pin) + 1)
k->ptr[n++] = MAKE_PTR(0,
bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
ca->sb.nr_this_dev);
+ atomic_long_inc(&c->reclaimed_journal_buckets);
}
if (n) {
struct journal_write *w;
atomic_t *ret;
+ /* No journaling if CACHE_SET_IO_DISABLE set already */
+ if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &c->flags)))
+ return NULL;
+
if (!CACHE_SYNC(&c->sb))
return NULL;
free_pages((unsigned long) c->journal.w[1].data, JSET_BITS);
free_pages((unsigned long) c->journal.w[0].data, JSET_BITS);
free_fifo(&c->journal.pin);
- free_heap(&c->flush_btree);
}
int bch_journal_alloc(struct cache_set *c)
struct journal *j = &c->journal;
spin_lock_init(&j->lock);
+ spin_lock_init(&j->flush_write_lock);
INIT_DELAYED_WORK(&j->work, journal_write_work);
c->journal_delay_ms = 100;
j->w[0].c = c;
j->w[1].c = c;
- if (!(init_heap(&c->flush_btree, 128, GFP_KERNEL)) ||
- !(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) ||
+ if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) ||
!(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)) ||
!(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)))
return -ENOMEM;
/* Embedded in struct cache_set */
struct journal {
spinlock_t lock;
+ spinlock_t flush_write_lock;
+ bool btree_flushing;
/* used when waiting because the journal was full */
struct closure_waitlist wait;
struct closure io;
struct bio_vec bv[8];
};
+#define BTREE_FLUSH_NR 8
+
#define journal_pin_cmp(c, l, r) \
(fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r)))
static struct kobject *bcache_kobj;
struct mutex bch_register_lock;
+bool bcache_is_reboot;
LIST_HEAD(bch_cache_sets);
static LIST_HEAD(uncached_devices);
struct workqueue_struct *bcache_wq;
struct workqueue_struct *bch_journal_wq;
+
#define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE)
/* limitation of partitions number on single bcache device */
#define BCACHE_MINORS 128
static void write_bdev_super_endio(struct bio *bio)
{
struct cached_dev *dc = bio->bi_private;
- /* XXX: error checking */
+
+ if (bio->bi_status)
+ bch_count_backing_io_errors(dc, bio);
closure_put(&dc->sb_write);
}
{
unsigned int i;
struct cache *ca;
+ int ret;
for_each_cache(ca, d->c, i)
bd_link_disk_holder(ca->bdev, d->disk);
snprintf(d->name, BCACHEDEVNAME_SIZE,
"%s%u", name, d->id);
- WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") ||
- sysfs_create_link(&c->kobj, &d->kobj, d->name),
- "Couldn't create device <-> cache set symlinks");
+ ret = sysfs_create_link(&d->kobj, &c->kobj, "cache");
+ if (ret < 0)
+ pr_err("Couldn't create device -> cache set symlink");
+
+ ret = sysfs_create_link(&c->kobj, &d->kobj, d->name);
+ if (ret < 0)
+ pr_err("Couldn't create cache set -> device symlink");
clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags);
}
}
-void bch_cached_dev_run(struct cached_dev *dc)
+int bch_cached_dev_run(struct cached_dev *dc)
{
struct bcache_device *d = &dc->disk;
char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL);
NULL,
};
+ if (dc->io_disable) {
+ pr_err("I/O disabled on cached dev %s",
+ dc->backing_dev_name);
+ return -EIO;
+ }
+
if (atomic_xchg(&dc->running, 1)) {
kfree(env[1]);
kfree(env[2]);
kfree(buf);
- return;
+ pr_info("cached dev %s is running already",
+ dc->backing_dev_name);
+ return -EBUSY;
}
if (!d->c &&
kfree(buf);
if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") ||
- sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache"))
- pr_debug("error creating sysfs link");
+ sysfs_create_link(&disk_to_dev(d->disk)->kobj,
+ &d->kobj, "bcache")) {
+ pr_err("Couldn't create bcache dev <-> disk sysfs symlinks");
+ return -ENOMEM;
+ }
dc->status_update_thread = kthread_run(cached_dev_status_update,
dc, "bcache_status_update");
"continue to run without monitoring backing "
"device status");
}
+
+ return 0;
}
/*
BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags));
BUG_ON(refcount_read(&dc->count));
- mutex_lock(&bch_register_lock);
if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags))
cancel_writeback_rate_update_dwork(dc);
bch_write_bdev_super(dc, &cl);
closure_sync(&cl);
+ mutex_lock(&bch_register_lock);
+
calc_cached_dev_sectors(dc->disk.c);
bcache_device_detach(&dc->disk);
list_move(&dc->list, &uncached_devices);
uint32_t rtime = cpu_to_le32((u32)ktime_get_real_seconds());
struct uuid_entry *u;
struct cached_dev *exist_dc, *t;
+ int ret = 0;
if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) ||
(!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16)))
down_write(&dc->writeback_lock);
if (bch_cached_dev_writeback_start(dc)) {
up_write(&dc->writeback_lock);
+ pr_err("Couldn't start writeback facilities for %s",
+ dc->disk.disk->disk_name);
return -ENOMEM;
}
bch_sectors_dirty_init(&dc->disk);
- bch_cached_dev_run(dc);
+ ret = bch_cached_dev_run(dc);
+ if (ret && (ret != -EBUSY)) {
+ up_write(&dc->writeback_lock);
+ /*
+ * bch_register_lock is held, bcache_device_stop() is not
+ * able to be directly called. The kthread and kworker
+ * created previously in bch_cached_dev_writeback_start()
+ * have to be stopped manually here.
+ */
+ kthread_stop(dc->writeback_thread);
+ cancel_writeback_rate_update_dwork(dc);
+ pr_err("Couldn't run cached device %s",
+ dc->backing_dev_name);
+ return ret;
+ }
+
bcache_device_link(&dc->disk, c, "bdev");
atomic_inc(&c->attached_dev_nr);
{
struct cached_dev *dc = container_of(cl, struct cached_dev, disk.cl);
- mutex_lock(&bch_register_lock);
-
if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags))
cancel_writeback_rate_update_dwork(dc);
if (!IS_ERR_OR_NULL(dc->writeback_thread))
kthread_stop(dc->writeback_thread);
- if (dc->writeback_write_wq)
- destroy_workqueue(dc->writeback_write_wq);
if (!IS_ERR_OR_NULL(dc->status_update_thread))
kthread_stop(dc->status_update_thread);
+ mutex_lock(&bch_register_lock);
+
if (atomic_read(&dc->running))
bd_unlink_disk_holder(dc->bdev, dc->disk.disk);
bcache_device_free(&dc->disk);
{
const char *err = "cannot allocate memory";
struct cache_set *c;
+ int ret = -ENOMEM;
bdevname(bdev, dc->backing_dev_name);
memcpy(&dc->sb, sb, sizeof(struct cache_sb));
bch_cached_dev_attach(dc, c, NULL);
if (BDEV_STATE(&dc->sb) == BDEV_STATE_NONE ||
- BDEV_STATE(&dc->sb) == BDEV_STATE_STALE)
- bch_cached_dev_run(dc);
+ BDEV_STATE(&dc->sb) == BDEV_STATE_STALE) {
+ err = "failed to run cached device";
+ ret = bch_cached_dev_run(dc);
+ if (ret)
+ goto err;
+ }
return 0;
err:
pr_notice("error %s: %s", dc->backing_dev_name, err);
bcache_device_stop(&dc->disk);
- return -EIO;
+ return ret;
}
/* Flash only volumes */
bool bch_cached_dev_error(struct cached_dev *dc)
{
- struct cache_set *c;
-
if (!dc || test_bit(BCACHE_DEV_CLOSING, &dc->disk.flags))
return false;
pr_err("stop %s: too many IO errors on backing device %s\n",
dc->disk.disk->disk_name, dc->backing_dev_name);
- /*
- * If the cached device is still attached to a cache set,
- * even dc->io_disable is true and no more I/O requests
- * accepted, cache device internal I/O (writeback scan or
- * garbage collection) may still prevent bcache device from
- * being stopped. So here CACHE_SET_IO_DISABLE should be
- * set to c->flags too, to make the internal I/O to cache
- * device rejected and stopped immediately.
- * If c is NULL, that means the bcache device is not attached
- * to any cache set, then no CACHE_SET_IO_DISABLE bit to set.
- */
- c = dc->disk.c;
- if (c && test_and_set_bit(CACHE_SET_IO_DISABLE, &c->flags))
- pr_info("CACHE_SET_IO_DISABLE already set");
-
bcache_device_stop(&dc->disk);
return true;
}
kobject_put(&c->internal);
kobject_del(&c->kobj);
- if (c->gc_thread)
+ if (!IS_ERR_OR_NULL(c->gc_thread))
kthread_stop(c->gc_thread);
if (!IS_ERR_OR_NULL(c->root))
list_add(&c->root->list, &c->btree_cache);
- /* Should skip this if we're unregistering because of an error */
- list_for_each_entry(b, &c->btree_cache, list) {
- mutex_lock(&b->write_lock);
- if (btree_node_dirty(b))
- __bch_btree_node_write(b, NULL);
- mutex_unlock(&b->write_lock);
- }
+ /*
+ * Avoid flushing cached nodes if cache set is retiring
+ * due to too many I/O errors detected.
+ */
+ if (!test_bit(CACHE_SET_IO_DISABLE, &c->flags))
+ list_for_each_entry(b, &c->btree_cache, list) {
+ mutex_lock(&b->write_lock);
+ if (btree_node_dirty(b))
+ __bch_btree_node_write(b, NULL);
+ mutex_unlock(&b->write_lock);
+ }
for_each_cache(ca, c, i)
if (ca->alloc_thread)
if (bch_btree_check(c))
goto err;
+ /*
+ * bch_btree_check() may occupy too much system memory which
+ * has negative effects to user space application (e.g. data
+ * base) performance. Shrink the mca cache memory proactively
+ * here to avoid competing memory with user space workloads..
+ */
+ if (!c->shrinker_disabled) {
+ struct shrink_control sc;
+
+ sc.gfp_mask = GFP_KERNEL;
+ sc.nr_to_scan = c->btree_cache_used * c->btree_pages;
+ /* first run to clear b->accessed tag */
+ c->shrink.scan_objects(&c->shrink, &sc);
+ /* second run to reap non-accessed nodes */
+ c->shrink.scan_objects(&c->shrink, &sc);
+ }
+
bch_journal_mark(c, &journal);
bch_initial_gc_finish(c);
pr_debug("btree_check() done");
}
closure_sync(&cl);
- /* XXX: test this, it's broken */
+
bch_cache_set_error(c, "%s", err);
return -EIO;
static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
const char *buffer, size_t size);
+static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
+ struct kobj_attribute *attr,
+ const char *buffer, size_t size);
kobj_attribute_write(register, register_bcache);
kobj_attribute_write(register_quiet, register_bcache);
+kobj_attribute_write(pendings_cleanup, bch_pending_bdevs_cleanup);
static bool bch_is_open_backing(struct block_device *bdev)
{
if (!try_module_get(THIS_MODULE))
return -EBUSY;
+ /* For latest state of bcache_is_reboot */
+ smp_mb();
+ if (bcache_is_reboot)
+ return -EBUSY;
+
path = kstrndup(buffer, size, GFP_KERNEL);
if (!path)
goto err;
goto out;
}
+
+struct pdev {
+ struct list_head list;
+ struct cached_dev *dc;
+};
+
+static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
+ struct kobj_attribute *attr,
+ const char *buffer,
+ size_t size)
+{
+ LIST_HEAD(pending_devs);
+ ssize_t ret = size;
+ struct cached_dev *dc, *tdc;
+ struct pdev *pdev, *tpdev;
+ struct cache_set *c, *tc;
+
+ mutex_lock(&bch_register_lock);
+ list_for_each_entry_safe(dc, tdc, &uncached_devices, list) {
+ pdev = kmalloc(sizeof(struct pdev), GFP_KERNEL);
+ if (!pdev)
+ break;
+ pdev->dc = dc;
+ list_add(&pdev->list, &pending_devs);
+ }
+
+ list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
+ list_for_each_entry_safe(c, tc, &bch_cache_sets, list) {
+ char *pdev_set_uuid = pdev->dc->sb.set_uuid;
+ char *set_uuid = c->sb.uuid;
+
+ if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
+ list_del(&pdev->list);
+ kfree(pdev);
+ break;
+ }
+ }
+ }
+ mutex_unlock(&bch_register_lock);
+
+ list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
+ pr_info("delete pdev %p", pdev);
+ list_del(&pdev->list);
+ bcache_device_stop(&pdev->dc->disk);
+ kfree(pdev);
+ }
+
+ return ret;
+}
+
static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
{
+ if (bcache_is_reboot)
+ return NOTIFY_DONE;
+
if (code == SYS_DOWN ||
code == SYS_HALT ||
code == SYS_POWER_OFF) {
mutex_lock(&bch_register_lock);
+ if (bcache_is_reboot)
+ goto out;
+
+ /* New registration is rejected since now */
+ bcache_is_reboot = true;
+ /*
+ * Make registering caller (if there is) on other CPU
+ * core know bcache_is_reboot set to true earlier
+ */
+ smp_mb();
+
if (list_empty(&bch_cache_sets) &&
list_empty(&uncached_devices))
goto out;
+ mutex_unlock(&bch_register_lock);
+
pr_info("Stopping all devices:");
+ /*
+ * The reason bch_register_lock is not held to call
+ * bch_cache_set_stop() and bcache_device_stop() is to
+ * avoid potential deadlock during reboot, because cache
+ * set or bcache device stopping process will acqurie
+ * bch_register_lock too.
+ *
+ * We are safe here because bcache_is_reboot sets to
+ * true already, register_bcache() will reject new
+ * registration now. bcache_is_reboot also makes sure
+ * bcache_reboot() won't be re-entered on by other thread,
+ * so there is no race in following list iteration by
+ * list_for_each_entry_safe().
+ */
list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
bch_cache_set_stop(c);
list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
bcache_device_stop(&dc->disk);
- mutex_unlock(&bch_register_lock);
/*
* Give an early chance for other kthreads and
static const struct attribute *files[] = {
&ksysfs_register.attr,
&ksysfs_register_quiet.attr,
+ &ksysfs_pendings_cleanup.attr,
NULL
};
bch_debug_init();
closure_debug_init();
+ bcache_is_reboot = false;
+
return 0;
err:
bcache_exit();
#include <linux/sort.h>
#include <linux/sched/clock.h>
+extern bool bcache_is_reboot;
+
/* Default is 0 ("writethrough") */
static const char * const bch_cache_modes[] = {
"writethrough",
"writeback",
"writearound",
- "none",
- NULL
+ "none"
};
/* Default is 0 ("auto") */
static const char * const bch_stop_on_failure_modes[] = {
"auto",
- "always",
- NULL
+ "always"
};
static const char * const cache_replacement_policies[] = {
"lru",
"fifo",
- "random",
- NULL
+ "random"
};
static const char * const error_actions[] = {
"unregister",
- "panic",
- NULL
+ "panic"
};
write_attribute(attach);
read_attribute(state);
read_attribute(cache_read_races);
read_attribute(reclaim);
+read_attribute(reclaimed_journal_buckets);
read_attribute(flush_write);
-read_attribute(retry_flush_write);
read_attribute(writeback_keys_done);
read_attribute(writeback_keys_failed);
read_attribute(io_errors);
var_print(writeback_percent);
sysfs_hprint(writeback_rate,
wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0);
- sysfs_hprint(io_errors, atomic_read(&dc->io_errors));
+ sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors));
sysfs_printf(io_error_limit, "%i", dc->error_limit);
sysfs_printf(io_disable, "%i", dc->io_disable);
var_print(writeback_rate_update_seconds);
struct cache_set *c;
struct kobj_uevent_env *env;
+ /* no user space access if system is rebooting */
+ if (bcache_is_reboot)
+ return -EBUSY;
+
#define d_strtoul(var) sysfs_strtoul(var, dc->var)
#define d_strtoul_nonzero(var) sysfs_strtoul_clamp(var, dc->var, 1, INT_MAX)
#define d_strtoi_h(var) sysfs_hatoi(var, dc->var)
bch_cache_accounting_clear(&dc->accounting);
if (attr == &sysfs_running &&
- strtoul_or_return(buf))
- bch_cached_dev_run(dc);
+ strtoul_or_return(buf)) {
+ v = bch_cached_dev_run(dc);
+ if (v)
+ return v;
+ }
if (attr == &sysfs_cache_mode) {
- v = __sysfs_match_string(bch_cache_modes, -1, buf);
+ v = sysfs_match_string(bch_cache_modes, buf);
if (v < 0)
return v;
}
if (attr == &sysfs_stop_when_cache_set_failed) {
- v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf);
+ v = sysfs_match_string(bch_stop_on_failure_modes, buf);
if (v < 0)
return v;
struct cached_dev *dc = container_of(kobj, struct cached_dev,
disk.kobj);
+ /* no user space access if system is rebooting */
+ if (bcache_is_reboot)
+ return -EBUSY;
+
mutex_lock(&bch_register_lock);
size = __cached_dev_store(kobj, attr, buf, size);
&sysfs_writeback_rate_p_term_inverse,
&sysfs_writeback_rate_minimum,
&sysfs_writeback_rate_debug,
- &sysfs_errors,
+ &sysfs_io_errors,
&sysfs_io_error_limit,
&sysfs_io_disable,
&sysfs_dirty_data,
kobj);
struct uuid_entry *u = &d->c->uuids[d->id];
+ /* no user space access if system is rebooting */
+ if (bcache_is_reboot)
+ return -EBUSY;
+
sysfs_strtoul(data_csum, d->data_csum);
if (attr == &sysfs_size) {
sysfs_print(reclaim,
atomic_long_read(&c->reclaim));
+ sysfs_print(reclaimed_journal_buckets,
+ atomic_long_read(&c->reclaimed_journal_buckets));
+
sysfs_print(flush_write,
atomic_long_read(&c->flush_write));
- sysfs_print(retry_flush_write,
- atomic_long_read(&c->retry_flush_write));
-
sysfs_print(writeback_keys_done,
atomic_long_read(&c->writeback_keys_done));
sysfs_print(writeback_keys_failed,
struct cache_set *c = container_of(kobj, struct cache_set, kobj);
ssize_t v;
+ /* no user space access if system is rebooting */
+ if (bcache_is_reboot)
+ return -EBUSY;
+
if (attr == &sysfs_unregister)
bch_cache_set_unregister(c);
0, UINT_MAX);
if (attr == &sysfs_errors) {
- v = __sysfs_match_string(error_actions, -1, buf);
+ v = sysfs_match_string(error_actions, buf);
if (v < 0)
return v;
{
struct cache_set *c = container_of(kobj, struct cache_set, internal);
+ /* no user space access if system is rebooting */
+ if (bcache_is_reboot)
+ return -EBUSY;
+
return bch_cache_set_store(&c->kobj, attr, buf, size);
}
&sysfs_bset_tree_stats,
&sysfs_cache_read_races,
&sysfs_reclaim,
+ &sysfs_reclaimed_journal_buckets,
&sysfs_flush_write,
- &sysfs_retry_flush_write,
&sysfs_writeback_keys_done,
&sysfs_writeback_keys_failed,
struct cache *ca = container_of(kobj, struct cache, kobj);
ssize_t v;
+ /* no user space access if system is rebooting */
+ if (bcache_is_reboot)
+ return -EBUSY;
+
if (attr == &sysfs_discard) {
bool v = strtoul_or_return(buf);
}
if (attr == &sysfs_cache_replacement_policy) {
- v = __sysfs_match_string(cache_replacement_policies, -1, buf);
+ v = sysfs_match_string(cache_replacement_policies, buf);
if (v < 0)
return v;
#define heap_full(h) ((h)->used == (h)->size)
-#define heap_empty(h) ((h)->used == 0)
-
#define DECLARE_FIFO(type, name) \
struct { \
size_t front, back, size, mask; \
static bool set_at_max_writeback_rate(struct cache_set *c,
struct cached_dev *dc)
{
+ /* Don't set max writeback rate if gc is running */
+ if (!c->gc_mark_valid)
+ return false;
/*
* Idle_counter is increased everytime when update_writeback_rate() is
* called. If all backing devices attached to the same cache set have
}
}
+ if (dc->writeback_write_wq) {
+ flush_workqueue(dc->writeback_write_wq);
+ destroy_workqueue(dc->writeback_write_wq);
+ }
cached_dev_put(dc);
wait_for_kthread_stop();
"bcache_writeback");
if (IS_ERR(dc->writeback_thread)) {
cached_dev_put(dc);
+ destroy_workqueue(dc->writeback_write_wq);
return PTR_ERR(dc->writeback_thread);
}
dc->writeback_running = true;
return;
md_bitmap_wait_behind_writes(mddev);
+ mempool_destroy(mddev->wb_info_pool);
+ mddev->wb_info_pool = NULL;
mutex_lock(&mddev->bitmap_info.mutex);
spin_lock(&mddev->lock);
sector_t start = 0;
sector_t sector = 0;
struct bitmap *bitmap = mddev->bitmap;
+ struct md_rdev *rdev;
if (!bitmap)
goto out;
+ rdev_for_each(rdev, mddev)
+ mddev_create_wb_pool(mddev, rdev, true);
+
if (mddev_is_clustered(mddev))
md_cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes);
backlog_store(struct mddev *mddev, const char *buf, size_t len)
{
unsigned long backlog;
+ unsigned long old_mwb = mddev->bitmap_info.max_write_behind;
int rv = kstrtoul(buf, 10, &backlog);
if (rv)
return rv;
if (backlog > COUNTER_MAX)
return -EINVAL;
mddev->bitmap_info.max_write_behind = backlog;
+ if (!backlog && mddev->wb_info_pool) {
+ /* wb_info_pool is not needed if backlog is zero */
+ mempool_destroy(mddev->wb_info_pool);
+ mddev->wb_info_pool = NULL;
+ } else if (backlog && !mddev->wb_info_pool) {
+ /* wb_info_pool is needed since backlog is not zero */
+ struct md_rdev *rdev;
+
+ rdev_for_each(rdev, mddev)
+ mddev_create_wb_pool(mddev, rdev, false);
+ }
+ if (old_mwb != backlog)
+ md_bitmap_update_sb(mddev->bitmap);
return len;
}
*/
+#include <linux/sched/mm.h>
#include <linux/sched/signal.h>
#include <linux/kthread.h>
#include <linux/blkdev.h>
mddev->sync_speed_max : sysctl_speed_limit_max;
}
+static int rdev_init_wb(struct md_rdev *rdev)
+{
+ if (rdev->bdev->bd_queue->nr_hw_queues == 1)
+ return 0;
+
+ spin_lock_init(&rdev->wb_list_lock);
+ INIT_LIST_HEAD(&rdev->wb_list);
+ init_waitqueue_head(&rdev->wb_io_wait);
+ set_bit(WBCollisionCheck, &rdev->flags);
+
+ return 1;
+}
+
+/*
+ * Create wb_info_pool if rdev is the first multi-queue device flaged
+ * with writemostly, also write-behind mode is enabled.
+ */
+void mddev_create_wb_pool(struct mddev *mddev, struct md_rdev *rdev,
+ bool is_suspend)
+{
+ if (mddev->bitmap_info.max_write_behind == 0)
+ return;
+
+ if (!test_bit(WriteMostly, &rdev->flags) || !rdev_init_wb(rdev))
+ return;
+
+ if (mddev->wb_info_pool == NULL) {
+ unsigned int noio_flag;
+
+ if (!is_suspend)
+ mddev_suspend(mddev);
+ noio_flag = memalloc_noio_save();
+ mddev->wb_info_pool = mempool_create_kmalloc_pool(NR_WB_INFOS,
+ sizeof(struct wb_info));
+ memalloc_noio_restore(noio_flag);
+ if (!mddev->wb_info_pool)
+ pr_err("can't alloc memory pool for writemostly\n");
+ if (!is_suspend)
+ mddev_resume(mddev);
+ }
+}
+EXPORT_SYMBOL_GPL(mddev_create_wb_pool);
+
+/*
+ * destroy wb_info_pool if rdev is the last device flaged with WBCollisionCheck.
+ */
+static void mddev_destroy_wb_pool(struct mddev *mddev, struct md_rdev *rdev)
+{
+ if (!test_and_clear_bit(WBCollisionCheck, &rdev->flags))
+ return;
+
+ if (mddev->wb_info_pool) {
+ struct md_rdev *temp;
+ int num = 0;
+
+ /*
+ * Check if other rdevs need wb_info_pool.
+ */
+ rdev_for_each(temp, mddev)
+ if (temp != rdev &&
+ test_bit(WBCollisionCheck, &temp->flags))
+ num++;
+ if (!num) {
+ mddev_suspend(rdev->mddev);
+ mempool_destroy(mddev->wb_info_pool);
+ mddev->wb_info_pool = NULL;
+ mddev_resume(rdev->mddev);
+ }
+ }
+}
+
static struct ctl_table_header *raid_table_header;
static struct ctl_table raid_table[] = {
rdev->mddev = mddev;
pr_debug("md: bind<%s>\n", b);
+ if (mddev->raid_disks)
+ mddev_create_wb_pool(mddev, rdev, false);
+
if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
goto fail;
bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk);
list_del_rcu(&rdev->same_set);
pr_debug("md: unbind<%s>\n", bdevname(rdev->bdev,b));
+ mddev_destroy_wb_pool(rdev->mddev, rdev);
rdev->mddev = NULL;
sysfs_remove_link(&rdev->kobj, "block");
sysfs_put(rdev->sysfs_state);
}
} else if (cmd_match(buf, "writemostly")) {
set_bit(WriteMostly, &rdev->flags);
+ mddev_create_wb_pool(rdev->mddev, rdev, false);
err = 0;
} else if (cmd_match(buf, "-writemostly")) {
+ mddev_destroy_wb_pool(rdev->mddev, rdev);
clear_bit(WriteMostly, &rdev->flags);
err = 0;
} else if (cmd_match(buf, "blocked")) {
if (!entry->show)
return -EIO;
if (!rdev->mddev)
- return -EBUSY;
+ return -ENODEV;
return entry->show(rdev, page);
}
mddev->bitmap = bitmap;
}
- if (err) {
- mddev_detach(mddev);
- if (mddev->private)
- pers->free(mddev, mddev->private);
- mddev->private = NULL;
- module_put(pers->owner);
- md_bitmap_destroy(mddev);
- goto abort;
+ if (err)
+ goto bitmap_abort;
+
+ if (mddev->bitmap_info.max_write_behind > 0) {
+ bool creat_pool = false;
+
+ rdev_for_each(rdev, mddev) {
+ if (test_bit(WriteMostly, &rdev->flags) &&
+ rdev_init_wb(rdev))
+ creat_pool = true;
+ }
+ if (creat_pool && mddev->wb_info_pool == NULL) {
+ mddev->wb_info_pool =
+ mempool_create_kmalloc_pool(NR_WB_INFOS,
+ sizeof(struct wb_info));
+ if (!mddev->wb_info_pool) {
+ err = -ENOMEM;
+ goto bitmap_abort;
+ }
+ }
}
+
if (mddev->queue) {
bool nonrot = true;
spin_unlock(&mddev->lock);
rdev_for_each(rdev, mddev)
if (rdev->raid_disk >= 0)
- if (sysfs_link_rdev(mddev, rdev))
- /* failure here is OK */;
+ sysfs_link_rdev(mddev, rdev); /* failure here is OK */
if (mddev->degraded && !mddev->ro)
/* This ensures that recovering status is reported immediately
sysfs_notify(&mddev->kobj, NULL, "degraded");
return 0;
+bitmap_abort:
+ mddev_detach(mddev);
+ if (mddev->private)
+ pers->free(mddev, mddev->private);
+ mddev->private = NULL;
+ module_put(pers->owner);
+ md_bitmap_destroy(mddev);
abort:
bioset_exit(&mddev->bio_set);
bioset_exit(&mddev->sync_set);
mddev->in_sync = 1;
md_update_sb(mddev, 1);
}
+ mempool_destroy(mddev->wb_info_pool);
+ mddev->wb_info_pool = NULL;
}
void md_stop_writes(struct mddev *mddev)
{
struct mddev *mddev = thread->mddev;
struct mddev *mddev2;
- unsigned int currspeed = 0,
- window;
+ unsigned int currspeed = 0, window;
sector_t max_sectors,j, io_sectors, recovery_done;
unsigned long mark[SYNC_MARKS];
unsigned long update_time;
* 0 == not engaged in resync at all
* 2 == checking that there is no conflict with another sync
* 1 == like 2, but have yielded to allow conflicting resync to
- * commense
+ * commence
* other == active in resync - this many blocks
*
* Before starting a resync we must have set curr_resync to
/*
* Tune reconstruction:
*/
- window = 32*(PAGE_SIZE/512);
+ window = 32 * (PAGE_SIZE / 512);
pr_debug("md: using %dk window, over a total of %lluk.\n",
window/2, (unsigned long long)max_sectors/2);
* perform resync with the new activated disk */
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
md_wakeup_thread(mddev->thread);
-
}
/* device faulty
* We just want to do the minimum to mark the disk
* for reporting to userspace and storing
* in superblock.
*/
+
+ /*
+ * The members for check collision of write behind IOs.
+ */
+ struct list_head wb_list;
+ spinlock_t wb_list_lock;
+ wait_queue_head_t wb_io_wait;
+
struct work_struct del_work; /* used for delayed sysfs removal */
struct kernfs_node *sysfs_state; /* handle for 'state'
* it didn't fail, so don't use FailFast
* any more for metadata
*/
+ WBCollisionCheck, /*
+ * multiqueue device should check if there
+ * is collision between write behind bios.
+ */
};
static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */
};
+#define NR_WB_INFOS 8
+/* record current range of write behind IOs */
+struct wb_info {
+ sector_t lo;
+ sector_t hi;
+ struct list_head list;
+};
+
struct mddev {
void *private;
struct md_personality *pers;
*/
struct work_struct flush_work;
struct work_struct event_work; /* used by dm to report failure event */
+ mempool_t *wb_info_pool;
void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
struct md_cluster_info *cluster_info;
unsigned int good_device_nr; /* good device num within cluster raid */
extern void md_reload_sb(struct mddev *mddev, int raid_disk);
extern void md_update_sb(struct mddev *mddev, int force);
extern void md_kick_rdev_from_array(struct md_rdev * rdev);
+extern void mddev_create_wb_pool(struct mddev *mddev, struct md_rdev *rdev,
+ bool is_suspend);
struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev);
#define RESYNC_BLOCK_SIZE (64*1024)
#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
+/*
+ * Number of guaranteed raid bios in case of extreme VM load:
+ */
+#define NR_RAID_BIOS 256
+
+/* when we get a read error on a read-only array, we redirect to another
+ * device without failing the first device, or trying to over-write to
+ * correct the read error. To keep track of bad blocks on a per-bio
+ * level, we store IO_BLOCKED in the appropriate 'bios' pointer
+ */
+#define IO_BLOCKED ((struct bio *)1)
+/* When we successfully write to a known bad-block, we need to remove the
+ * bad-block marking which must be done from process context. So we record
+ * the success by setting devs[n].bio to IO_MADE_GOOD
+ */
+#define IO_MADE_GOOD ((struct bio *)2)
+
+#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
+
+/* When there are this many requests queue to be written by
+ * the raid thread, we become 'congested' to provide back-pressure
+ * for writeback.
+ */
+static int max_queued_requests = 1024;
+
/* for managing resync I/O pages */
struct resync_pages {
void *raid_bio;
struct page *pages[RESYNC_PAGES];
};
+static void rbio_pool_free(void *rbio, void *data)
+{
+ kfree(rbio);
+}
+
static inline int resync_alloc_pages(struct resync_pages *rp,
gfp_t gfp_flags)
{
(1L << MD_HAS_PPL) | \
(1L << MD_HAS_MULTIPLE_PPLS))
-/*
- * Number of guaranteed r1bios in case of extreme VM load:
- */
-#define NR_RAID1_BIOS 256
-
-/* when we get a read error on a read-only array, we redirect to another
- * device without failing the first device, or trying to over-write to
- * correct the read error. To keep track of bad blocks on a per-bio
- * level, we store IO_BLOCKED in the appropriate 'bios' pointer
- */
-#define IO_BLOCKED ((struct bio *)1)
-/* When we successfully write to a known bad-block, we need to remove the
- * bad-block marking which must be done from process context. So we record
- * the success by setting devs[n].bio to IO_MADE_GOOD
- */
-#define IO_MADE_GOOD ((struct bio *)2)
-
-#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
-
-/* When there are this many requests queue to be written by
- * the raid1 thread, we become 'congested' to provide back-pressure
- * for writeback.
- */
-static int max_queued_requests = 1024;
-
static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
#include "raid1-10.c"
+static int check_and_add_wb(struct md_rdev *rdev, sector_t lo, sector_t hi)
+{
+ struct wb_info *wi, *temp_wi;
+ unsigned long flags;
+ int ret = 0;
+ struct mddev *mddev = rdev->mddev;
+
+ wi = mempool_alloc(mddev->wb_info_pool, GFP_NOIO);
+
+ spin_lock_irqsave(&rdev->wb_list_lock, flags);
+ list_for_each_entry(temp_wi, &rdev->wb_list, list) {
+ /* collision happened */
+ if (hi > temp_wi->lo && lo < temp_wi->hi) {
+ ret = -EBUSY;
+ break;
+ }
+ }
+
+ if (!ret) {
+ wi->lo = lo;
+ wi->hi = hi;
+ list_add(&wi->list, &rdev->wb_list);
+ } else
+ mempool_free(wi, mddev->wb_info_pool);
+ spin_unlock_irqrestore(&rdev->wb_list_lock, flags);
+
+ return ret;
+}
+
+static void remove_wb(struct md_rdev *rdev, sector_t lo, sector_t hi)
+{
+ struct wb_info *wi;
+ unsigned long flags;
+ int found = 0;
+ struct mddev *mddev = rdev->mddev;
+
+ spin_lock_irqsave(&rdev->wb_list_lock, flags);
+ list_for_each_entry(wi, &rdev->wb_list, list)
+ if (hi == wi->hi && lo == wi->lo) {
+ list_del(&wi->list);
+ mempool_free(wi, mddev->wb_info_pool);
+ found = 1;
+ break;
+ }
+
+ if (!found)
+ WARN(1, "The write behind IO is not recorded\n");
+ spin_unlock_irqrestore(&rdev->wb_list_lock, flags);
+ wake_up(&rdev->wb_io_wait);
+}
+
/*
* for resync bio, r1bio pointer can be retrieved from the per-bio
* 'struct resync_pages'.
return kzalloc(size, gfp_flags);
}
-static void r1bio_pool_free(void *r1_bio, void *data)
-{
- kfree(r1_bio);
-}
-
#define RESYNC_DEPTH 32
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
#define RESYNC_WINDOW (RESYNC_BLOCK_SIZE * RESYNC_DEPTH)
kfree(rps);
out_free_r1bio:
- r1bio_pool_free(r1_bio, data);
+ rbio_pool_free(r1_bio, data);
return NULL;
}
/* resync pages array stored in the 1st bio's .bi_private */
kfree(rp);
- r1bio_pool_free(r1bio, data);
+ rbio_pool_free(r1bio, data);
}
static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio)
}
if (behind) {
+ if (test_bit(WBCollisionCheck, &rdev->flags)) {
+ sector_t lo = r1_bio->sector;
+ sector_t hi = r1_bio->sector + r1_bio->sectors;
+
+ remove_wb(rdev, lo, hi);
+ }
if (test_bit(WriteMostly, &rdev->flags))
atomic_dec(&r1_bio->behind_remaining);
if (!r1_bio->bios[i])
continue;
-
if (first_clone) {
/* do behind I/O ?
* Not if there are too many, or cannot
mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
if (r1_bio->behind_master_bio) {
- if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags))
+ struct md_rdev *rdev = conf->mirrors[i].rdev;
+
+ if (test_bit(WBCollisionCheck, &rdev->flags)) {
+ sector_t lo = r1_bio->sector;
+ sector_t hi = r1_bio->sector + r1_bio->sectors;
+
+ wait_event(rdev->wb_io_wait,
+ check_and_add_wb(rdev, lo, hi) == 0);
+ }
+ if (test_bit(WriteMostly, &rdev->flags))
atomic_inc(&r1_bio->behind_remaining);
}
first = last = rdev->saved_raid_disk;
for (mirror = first; mirror <= last; mirror++) {
- p = conf->mirrors+mirror;
+ p = conf->mirrors + mirror;
if (!p->rdev) {
-
if (mddev->gendisk)
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);
if (read_targets == 1)
bio->bi_opf &= ~MD_FAILFAST;
generic_make_request(bio);
-
}
return nr_sectors;
}
if (!conf->poolinfo)
goto abort;
conf->poolinfo->raid_disks = mddev->raid_disks * 2;
- err = mempool_init(&conf->r1bio_pool, NR_RAID1_BIOS, r1bio_pool_alloc,
- r1bio_pool_free, conf->poolinfo);
+ err = mempool_init(&conf->r1bio_pool, NR_RAID_BIOS, r1bio_pool_alloc,
+ rbio_pool_free, conf->poolinfo);
if (err)
goto abort;
}
mddev->degraded = 0;
- for (i=0; i < conf->raid_disks; i++)
+ for (i = 0; i < conf->raid_disks; i++)
if (conf->mirrors[i].rdev == NULL ||
!test_bit(In_sync, &conf->mirrors[i].rdev->flags) ||
test_bit(Faulty, &conf->mirrors[i].rdev->flags))
mddev->queue);
}
- ret = md_integrity_register(mddev);
+ ret = md_integrity_register(mddev);
if (ret) {
md_unregister_thread(&mddev->thread);
raid1_free(mddev, conf);
newpoolinfo->mddev = mddev;
newpoolinfo->raid_disks = raid_disks * 2;
- ret = mempool_init(&newpool, NR_RAID1_BIOS, r1bio_pool_alloc,
- r1bio_pool_free, newpoolinfo);
+ ret = mempool_init(&newpool, NR_RAID_BIOS, r1bio_pool_alloc,
+ rbio_pool_free, newpoolinfo);
if (ret) {
kfree(newpoolinfo);
return ret;
* [B A] [D C] [B A] [E C D]
*/
-/*
- * Number of guaranteed r10bios in case of extreme VM load:
- */
-#define NR_RAID10_BIOS 256
-
-/* when we get a read error on a read-only array, we redirect to another
- * device without failing the first device, or trying to over-write to
- * correct the read error. To keep track of bad blocks on a per-bio
- * level, we store IO_BLOCKED in the appropriate 'bios' pointer
- */
-#define IO_BLOCKED ((struct bio *)1)
-/* When we successfully write to a known bad-block, we need to remove the
- * bad-block marking which must be done from process context. So we record
- * the success by setting devs[n].bio to IO_MADE_GOOD
- */
-#define IO_MADE_GOOD ((struct bio *)2)
-
-#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
-
-/* When there are this many requests queued to be written by
- * the raid10 thread, we become 'congested' to provide back-pressure
- * for writeback.
- */
-static int max_queued_requests = 1024;
-
static void allow_barrier(struct r10conf *conf);
static void lower_barrier(struct r10conf *conf);
static int _enough(struct r10conf *conf, int previous, int ignore);
return kzalloc(size, gfp_flags);
}
-static void r10bio_pool_free(void *r10_bio, void *data)
-{
- kfree(r10_bio);
-}
-
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
/* amount of memory to reserve for resync requests */
#define RESYNC_WINDOW (1024*1024)
}
kfree(rps);
out_free_r10bio:
- r10bio_pool_free(r10_bio, conf);
+ rbio_pool_free(r10_bio, conf);
return NULL;
}
/* resync pages array stored in the 1st bio's .bi_private */
kfree(rp);
- r10bio_pool_free(r10bio, conf);
+ rbio_pool_free(r10bio, conf);
}
static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio)
int sectors = r10_bio->sectors;
int best_good_sectors;
sector_t new_distance, best_dist;
- struct md_rdev *best_rdev, *rdev = NULL;
+ struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL;
int do_balance;
- int best_slot;
+ int best_dist_slot, best_pending_slot;
+ bool has_nonrot_disk = false;
+ unsigned int min_pending;
struct geom *geo = &conf->geo;
raid10_find_phys(conf, r10_bio);
rcu_read_lock();
- best_slot = -1;
- best_rdev = NULL;
+ best_dist_slot = -1;
+ min_pending = UINT_MAX;
+ best_dist_rdev = NULL;
+ best_pending_rdev = NULL;
best_dist = MaxSector;
best_good_sectors = 0;
do_balance = 1;
sector_t first_bad;
int bad_sectors;
sector_t dev_sector;
+ unsigned int pending;
+ bool nonrot;
if (r10_bio->devs[slot].bio == IO_BLOCKED)
continue;
first_bad - dev_sector;
if (good_sectors > best_good_sectors) {
best_good_sectors = good_sectors;
- best_slot = slot;
- best_rdev = rdev;
+ best_dist_slot = slot;
+ best_dist_rdev = rdev;
}
if (!do_balance)
/* Must read from here */
if (!do_balance)
break;
- if (best_slot >= 0)
+ nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
+ has_nonrot_disk |= nonrot;
+ pending = atomic_read(&rdev->nr_pending);
+ if (min_pending > pending && nonrot) {
+ min_pending = pending;
+ best_pending_slot = slot;
+ best_pending_rdev = rdev;
+ }
+
+ if (best_dist_slot >= 0)
/* At least 2 disks to choose from so failfast is OK */
set_bit(R10BIO_FailFast, &r10_bio->state);
/* This optimisation is debatable, and completely destroys
* sequential read speed for 'far copies' arrays. So only
* keep it for 'near' arrays, and review those later.
*/
- if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
+ if (geo->near_copies > 1 && !pending)
new_distance = 0;
/* for far > 1 always use the lowest address */
else
new_distance = abs(r10_bio->devs[slot].addr -
conf->mirrors[disk].head_position);
+
if (new_distance < best_dist) {
best_dist = new_distance;
- best_slot = slot;
- best_rdev = rdev;
+ best_dist_slot = slot;
+ best_dist_rdev = rdev;
}
}
if (slot >= conf->copies) {
- slot = best_slot;
- rdev = best_rdev;
+ if (has_nonrot_disk) {
+ slot = best_pending_slot;
+ rdev = best_pending_rdev;
+ } else {
+ slot = best_dist_slot;
+ rdev = best_dist_rdev;
+ }
}
if (slot >= 0) {
conf->geo = geo;
conf->copies = copies;
- err = mempool_init(&conf->r10bio_pool, NR_RAID10_BIOS, r10bio_pool_alloc,
- r10bio_pool_free, conf);
+ err = mempool_init(&conf->r10bio_pool, NR_RAID_BIOS, r10bio_pool_alloc,
+ rbio_pool_free, conf);
if (err)
goto out;
int idx = 0;
struct page **pages;
- r10b = kmalloc(sizeof(*r10b) +
- sizeof(struct r10dev) * conf->copies, GFP_NOIO);
+ r10b = kmalloc(struct_size(r10b, devs, conf->copies), GFP_NOIO);
if (!r10b) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
return -ENOMEM;
rcu_read_unlock();
raid_bio->bi_next = (void*)rdev;
bio_set_dev(align_bi, rdev->bdev);
- bio_clear_flag(align_bi, BIO_SEG_VALID);
if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
bio_sectors(align_bi),
static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
{
struct r5conf *conf = mddev->private;
- int err = -EEXIST;
+ int ret, err = -EEXIST;
int disk;
struct disk_info *p;
int first = 0;
* The array is in readonly mode if journal is missing, so no
* write requests running. We should be safe
*/
- log_init(conf, rdev, false);
+ ret = log_init(conf, rdev, false);
+ if (ret)
+ return ret;
+
+ ret = r5l_start(conf->log);
+ if (ret)
+ return ret;
+
return 0;
}
if (mddev->recovery_disabled == conf->recovery_disabled)
return id;
}
-static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
- void *buffer, size_t buflen, u32 *result)
+static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen, u32 *result)
{
struct nvme_command c;
union nvme_result res;
int ret;
memset(&c, 0, sizeof(c));
- c.features.opcode = nvme_admin_set_features;
+ c.features.opcode = op;
c.features.fid = cpu_to_le32(fid);
c.features.dword11 = cpu_to_le32(dword11);
return ret;
}
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result)
+{
+ return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
+ buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_set_features);
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result)
+{
+ return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
+ buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_get_features);
+
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
{
u32 q_count = (*count - 1) | ((*count - 1) << 16);
device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups);
nvme_mpath_add_disk(ns, id);
- nvme_fault_inject_init(ns);
+ nvme_fault_inject_init(&ns->fault_inject, ns->disk->disk_name);
kfree(id);
return 0;
if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
return;
- nvme_fault_inject_fini(ns);
+ nvme_fault_inject_fini(&ns->fault_inject);
+
+ mutex_lock(&ns->ctrl->subsys->lock);
+ list_del_rcu(&ns->siblings);
+ mutex_unlock(&ns->ctrl->subsys->lock);
+ synchronize_rcu(); /* guarantee not available in head->list */
+ nvme_mpath_clear_current_path(ns);
+ synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */
+
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
blk_integrity_unregister(ns->disk);
}
- mutex_lock(&ns->ctrl->subsys->lock);
- list_del_rcu(&ns->siblings);
- nvme_mpath_clear_current_path(ns);
- mutex_unlock(&ns->ctrl->subsys->lock);
-
down_write(&ns->ctrl->namespaces_rwsem);
list_del_init(&ns->list);
up_write(&ns->ctrl->namespaces_rwsem);
- synchronize_srcu(&ns->head->srcu);
nvme_mpath_check_last_path(ns);
nvme_put_ns(ns);
}
void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
{
+ nvme_fault_inject_fini(&ctrl->fault_inject);
dev_pm_qos_hide_latency_tolerance(ctrl->device);
cdev_device_del(&ctrl->cdev, ctrl->device);
}
dev_pm_qos_update_user_latency_tolerance(ctrl->device,
min(default_ps_max_latency_us, (unsigned long)S32_MAX));
+ nvme_fault_inject_init(&ctrl->fault_inject, dev_name(ctrl->device));
+
return 0;
out_free_name:
kfree_const(ctrl->device->kobj.name);
switch (ctrl->state) {
case NVME_CTRL_NEW:
case NVME_CTRL_CONNECTING:
- if (req->cmd->common.opcode == nvme_fabrics_command &&
+ if (nvme_is_fabrics(req->cmd) &&
req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
return true;
break;
static char *fail_request;
module_param(fail_request, charp, 0000);
-void nvme_fault_inject_init(struct nvme_ns *ns)
+void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj,
+ const char *dev_name)
{
struct dentry *dir, *parent;
- char *name = ns->disk->disk_name;
- struct nvme_fault_inject *fault_inj = &ns->fault_inject;
struct fault_attr *attr = &fault_inj->attr;
/* set default fault injection attribute */
setup_fault_attr(&fail_default_attr, fail_request);
/* create debugfs directory and attribute */
- parent = debugfs_create_dir(name, NULL);
+ parent = debugfs_create_dir(dev_name, NULL);
if (!parent) {
- pr_warn("%s: failed to create debugfs directory\n", name);
+ pr_warn("%s: failed to create debugfs directory\n", dev_name);
return;
}
*attr = fail_default_attr;
dir = fault_create_debugfs_attr("fault_inject", parent, attr);
if (IS_ERR(dir)) {
- pr_warn("%s: failed to create debugfs attr\n", name);
+ pr_warn("%s: failed to create debugfs attr\n", dev_name);
debugfs_remove_recursive(parent);
return;
}
- ns->fault_inject.parent = parent;
+ fault_inj->parent = parent;
/* create debugfs for status code and dont_retry */
fault_inj->status = NVME_SC_INVALID_OPCODE;
debugfs_create_bool("dont_retry", 0600, dir, &fault_inj->dont_retry);
}
-void nvme_fault_inject_fini(struct nvme_ns *ns)
+void nvme_fault_inject_fini(struct nvme_fault_inject *fault_inject)
{
/* remove debugfs directories */
- debugfs_remove_recursive(ns->fault_inject.parent);
+ debugfs_remove_recursive(fault_inject->parent);
}
void nvme_should_fail(struct request *req)
{
struct gendisk *disk = req->rq_disk;
- struct nvme_ns *ns = NULL;
+ struct nvme_fault_inject *fault_inject = NULL;
u16 status;
- /*
- * make sure this request is coming from a valid namespace
- */
- if (!disk)
- return;
+ if (disk) {
+ struct nvme_ns *ns = disk->private_data;
+
+ if (ns)
+ fault_inject = &ns->fault_inject;
+ else
+ WARN_ONCE(1, "No namespace found for request\n");
+ } else {
+ fault_inject = &nvme_req(req)->ctrl->fault_inject;
+ }
- ns = disk->private_data;
- if (ns && should_fail(&ns->fault_inject.attr, 1)) {
+ if (fault_inject && should_fail(&fault_inject->attr, 1)) {
/* inject status code and DNR bit */
- status = ns->fault_inject.status;
- if (ns->fault_inject.dont_retry)
+ status = fault_inject->status;
+ if (fault_inject->dont_retry)
status |= NVME_SC_DNR;
nvme_req(req)->status = status;
}
if (nvme_fc_ctlr_active_on_rport(ctrl))
return -ENOTUNIQ;
+ dev_info(ctrl->ctrl.device,
+ "NVME-FC{%d}: create association : host wwpn 0x%016llx "
+ " rport wwpn 0x%016llx: NQN \"%s\"\n",
+ ctrl->cnum, ctrl->lport->localport.port_name,
+ ctrl->rport->remoteport.port_name, ctrl->ctrl.opts->subsysnqn);
+
/*
* Create the admin queue
*/
rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
if (rqd->bio)
- blk_init_request_from_bio(rq, rqd->bio);
+ blk_rq_append_bio(rq, &rqd->bio);
else
rq->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM);
NVME_CTRL_DEAD,
};
+struct nvme_fault_inject {
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+ struct fault_attr attr;
+ struct dentry *parent;
+ bool dont_retry; /* DNR, do not retry */
+ u16 status; /* status code */
+#endif
+};
+
struct nvme_ctrl {
bool comp_seen;
enum nvme_ctrl_state state;
struct page *discard_page;
unsigned long discard_page_busy;
+
+ struct nvme_fault_inject fault_inject;
};
enum nvme_iopolicy {
#endif
};
-#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
-struct nvme_fault_inject {
- struct fault_attr attr;
- struct dentry *parent;
- bool dont_retry; /* DNR, do not retry */
- u16 status; /* status code */
-};
-#endif
-
struct nvme_ns {
struct list_head list;
#define NVME_NS_ANA_PENDING 2
u16 noiob;
-#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
struct nvme_fault_inject fault_inject;
-#endif
};
};
#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
-void nvme_fault_inject_init(struct nvme_ns *ns);
-void nvme_fault_inject_fini(struct nvme_ns *ns);
+void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj,
+ const char *dev_name);
+void nvme_fault_inject_fini(struct nvme_fault_inject *fault_inject);
void nvme_should_fail(struct request *req);
#else
-static inline void nvme_fault_inject_init(struct nvme_ns *ns) {}
-static inline void nvme_fault_inject_fini(struct nvme_ns *ns) {}
+static inline void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj,
+ const char *dev_name)
+{
+}
+static inline void nvme_fault_inject_fini(struct nvme_fault_inject *fault_inj)
+{
+}
static inline void nvme_should_fail(struct request *req) {}
#endif
union nvme_result *result, void *buffer, unsigned bufflen,
unsigned timeout, int qid, int at_head,
blk_mq_req_flags_t flags, bool poll);
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result);
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
#include <linux/mutex.h>
#include <linux/once.h>
#include <linux/pci.h>
+#include <linux/suspend.h>
#include <linux/t10-pi.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644);
MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2");
-static int queue_count_set(const char *val, const struct kernel_param *kp);
-static const struct kernel_param_ops queue_count_ops = {
- .set = queue_count_set,
- .get = param_get_int,
-};
-
static int write_queues;
-module_param_cb(write_queues, &queue_count_ops, &write_queues, 0644);
+module_param(write_queues, int, 0644);
MODULE_PARM_DESC(write_queues,
"Number of queues to use for writes. If not set, reads and writes "
"will share a queue set.");
-static int poll_queues = 0;
-module_param_cb(poll_queues, &queue_count_ops, &poll_queues, 0644);
+static int poll_queues;
+module_param(poll_queues, int, 0644);
MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
struct nvme_dev;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
+ u32 last_ps;
mempool_t *iod_mempool;
return param_set_int(val, kp);
}
-static int queue_count_set(const char *val, const struct kernel_param *kp)
-{
- int n, ret;
-
- ret = kstrtoint(val, 10, &n);
- if (ret)
- return ret;
- if (n > num_possible_cpus())
- n = num_possible_cpus();
-
- return param_set_int(val, kp);
-}
-
static inline unsigned int sq_idx(unsigned int qid, u32 stride)
{
return qid * 2 * stride;
.priv = dev,
};
unsigned int irq_queues, this_p_queues;
+ unsigned int nr_cpus = num_possible_cpus();
/*
* Poll queues don't need interrupts, but we need at least one IO
this_p_queues = nr_io_queues - 1;
irq_queues = 1;
} else {
- irq_queues = nr_io_queues - this_p_queues + 1;
+ if (nr_cpus < nr_io_queues - this_p_queues)
+ irq_queues = nr_cpus + 1;
+ else
+ irq_queues = nr_io_queues - this_p_queues + 1;
}
dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
kfree(dev);
}
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
{
- dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
-
nvme_get_ctrl(&dev->ctrl);
nvme_dev_disable(dev, false);
nvme_kill_queues(&dev->ctrl);
struct nvme_dev *dev =
container_of(work, struct nvme_dev, ctrl.reset_work);
bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
- int result = -ENODEV;
+ int result;
enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
- if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
+ if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) {
+ result = -ENODEV;
goto out;
+ }
/*
* If we're called to reset a live controller first shut it down before
if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING)) {
dev_warn(dev->ctrl.device,
"failed to mark controller CONNECTING\n");
+ result = -EBUSY;
goto out;
}
if (!nvme_change_ctrl_state(&dev->ctrl, new_state)) {
dev_warn(dev->ctrl.device,
"failed to mark controller state %d\n", new_state);
+ result = -ENODEV;
goto out;
}
out_unlock:
mutex_unlock(&dev->shutdown_lock);
out:
- nvme_remove_dead_ctrl(dev, result);
+ if (result)
+ dev_warn(dev->ctrl.device,
+ "Removing after probe failure status: %d\n", result);
+ nvme_remove_dead_ctrl(dev);
}
static void nvme_remove_dead_ctrl_work(struct work_struct *work)
}
#ifdef CONFIG_PM_SLEEP
+static int nvme_get_power_state(struct nvme_ctrl *ctrl, u32 *ps)
+{
+ return nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0, ps);
+}
+
+static int nvme_set_power_state(struct nvme_ctrl *ctrl, u32 ps)
+{
+ return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL);
+}
+
+static int nvme_resume(struct device *dev)
+{
+ struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
+ struct nvme_ctrl *ctrl = &ndev->ctrl;
+
+ if (pm_resume_via_firmware() || !ctrl->npss ||
+ nvme_set_power_state(ctrl, ndev->last_ps) != 0)
+ nvme_reset_ctrl(ctrl);
+ return 0;
+}
+
static int nvme_suspend(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);
+ struct nvme_ctrl *ctrl = &ndev->ctrl;
+ int ret = -EBUSY;
+
+ /*
+ * The platform does not remove power for a kernel managed suspend so
+ * use host managed nvme power settings for lowest idle power if
+ * possible. This should have quicker resume latency than a full device
+ * shutdown. But if the firmware is involved after the suspend or the
+ * device does not support any non-default power states, shut down the
+ * device fully.
+ */
+ if (pm_suspend_via_firmware() || !ctrl->npss) {
+ nvme_dev_disable(ndev, true);
+ return 0;
+ }
+
+ nvme_start_freeze(ctrl);
+ nvme_wait_freeze(ctrl);
+ nvme_sync_queues(ctrl);
+
+ if (ctrl->state != NVME_CTRL_LIVE &&
+ ctrl->state != NVME_CTRL_ADMIN_ONLY)
+ goto unfreeze;
+
+ ndev->last_ps = 0;
+ ret = nvme_get_power_state(ctrl, &ndev->last_ps);
+ if (ret < 0)
+ goto unfreeze;
+
+ ret = nvme_set_power_state(ctrl, ctrl->npss);
+ if (ret < 0)
+ goto unfreeze;
+
+ if (ret) {
+ /*
+ * Clearing npss forces a controller reset on resume. The
+ * correct value will be resdicovered then.
+ */
+ nvme_dev_disable(ndev, true);
+ ctrl->npss = 0;
+ ret = 0;
+ goto unfreeze;
+ }
+ /*
+ * A saved state prevents pci pm from generically controlling the
+ * device's power. If we're using protocol specific settings, we don't
+ * want pci interfering.
+ */
+ pci_save_state(pdev);
+unfreeze:
+ nvme_unfreeze(ctrl);
+ return ret;
+}
+
+static int nvme_simple_suspend(struct device *dev)
+{
+ struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
nvme_dev_disable(ndev, true);
return 0;
}
-static int nvme_resume(struct device *dev)
+static int nvme_simple_resume(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);
nvme_reset_ctrl(&ndev->ctrl);
return 0;
}
-#endif
-static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
+const struct dev_pm_ops nvme_dev_pm_ops = {
+ .suspend = nvme_suspend,
+ .resume = nvme_resume,
+ .freeze = nvme_simple_suspend,
+ .thaw = nvme_simple_resume,
+ .poweroff = nvme_simple_suspend,
+ .restore = nvme_simple_resume,
+};
+#endif /* CONFIG_PM_SLEEP */
static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
.probe = nvme_probe,
.remove = nvme_remove,
.shutdown = nvme_shutdown,
+#ifdef CONFIG_PM_SLEEP
.driver = {
.pm = &nvme_dev_pm_ops,
},
+#endif
.sriov_configure = pci_sriov_configure_simple,
.err_handler = &nvme_err_handler,
};
}
}
+static const char *nvme_trace_fabrics_property_set(struct trace_seq *p, u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u8 attrib = spc[0];
+ u32 ofst = get_unaligned_le32(spc + 4);
+ u64 value = get_unaligned_le64(spc + 8);
+
+ trace_seq_printf(p, "attrib=%u, ofst=0x%x, value=0x%llx",
+ attrib, ofst, value);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+static const char *nvme_trace_fabrics_connect(struct trace_seq *p, u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u16 recfmt = get_unaligned_le16(spc);
+ u16 qid = get_unaligned_le16(spc + 2);
+ u16 sqsize = get_unaligned_le16(spc + 4);
+ u8 cattr = spc[6];
+ u32 kato = get_unaligned_le32(spc + 8);
+
+ trace_seq_printf(p, "recfmt=%u, qid=%u, sqsize=%u, cattr=%u, kato=%u",
+ recfmt, qid, sqsize, cattr, kato);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+static const char *nvme_trace_fabrics_property_get(struct trace_seq *p, u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u8 attrib = spc[0];
+ u32 ofst = get_unaligned_le32(spc + 4);
+
+ trace_seq_printf(p, "attrib=%u, ofst=0x%x", attrib, ofst);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+static const char *nvme_trace_fabrics_common(struct trace_seq *p, u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+
+ trace_seq_printf(p, "spcecific=%*ph", 24, spc);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+const char *nvme_trace_parse_fabrics_cmd(struct trace_seq *p,
+ u8 fctype, u8 *spc)
+{
+ switch (fctype) {
+ case nvme_fabrics_type_property_set:
+ return nvme_trace_fabrics_property_set(p, spc);
+ case nvme_fabrics_type_connect:
+ return nvme_trace_fabrics_connect(p, spc);
+ case nvme_fabrics_type_property_get:
+ return nvme_trace_fabrics_property_get(p, spc);
+ default:
+ return nvme_trace_fabrics_common(p, spc);
+ }
+}
+
const char *nvme_trace_disk_name(struct trace_seq *p, char *name)
{
const char *ret = trace_seq_buffer_ptr(p);
return ret;
}
-EXPORT_SYMBOL_GPL(nvme_trace_disk_name);
EXPORT_TRACEPOINT_SYMBOL_GPL(nvme_sq);
#include "nvme.h"
-#define nvme_admin_opcode_name(opcode) { opcode, #opcode }
-#define show_admin_opcode_name(val) \
- __print_symbolic(val, \
- nvme_admin_opcode_name(nvme_admin_delete_sq), \
- nvme_admin_opcode_name(nvme_admin_create_sq), \
- nvme_admin_opcode_name(nvme_admin_get_log_page), \
- nvme_admin_opcode_name(nvme_admin_delete_cq), \
- nvme_admin_opcode_name(nvme_admin_create_cq), \
- nvme_admin_opcode_name(nvme_admin_identify), \
- nvme_admin_opcode_name(nvme_admin_abort_cmd), \
- nvme_admin_opcode_name(nvme_admin_set_features), \
- nvme_admin_opcode_name(nvme_admin_get_features), \
- nvme_admin_opcode_name(nvme_admin_async_event), \
- nvme_admin_opcode_name(nvme_admin_ns_mgmt), \
- nvme_admin_opcode_name(nvme_admin_activate_fw), \
- nvme_admin_opcode_name(nvme_admin_download_fw), \
- nvme_admin_opcode_name(nvme_admin_ns_attach), \
- nvme_admin_opcode_name(nvme_admin_keep_alive), \
- nvme_admin_opcode_name(nvme_admin_directive_send), \
- nvme_admin_opcode_name(nvme_admin_directive_recv), \
- nvme_admin_opcode_name(nvme_admin_dbbuf), \
- nvme_admin_opcode_name(nvme_admin_format_nvm), \
- nvme_admin_opcode_name(nvme_admin_security_send), \
- nvme_admin_opcode_name(nvme_admin_security_recv), \
- nvme_admin_opcode_name(nvme_admin_sanitize_nvm))
-
-#define nvme_opcode_name(opcode) { opcode, #opcode }
-#define show_nvm_opcode_name(val) \
- __print_symbolic(val, \
- nvme_opcode_name(nvme_cmd_flush), \
- nvme_opcode_name(nvme_cmd_write), \
- nvme_opcode_name(nvme_cmd_read), \
- nvme_opcode_name(nvme_cmd_write_uncor), \
- nvme_opcode_name(nvme_cmd_compare), \
- nvme_opcode_name(nvme_cmd_write_zeroes), \
- nvme_opcode_name(nvme_cmd_dsm), \
- nvme_opcode_name(nvme_cmd_resv_register), \
- nvme_opcode_name(nvme_cmd_resv_report), \
- nvme_opcode_name(nvme_cmd_resv_acquire), \
- nvme_opcode_name(nvme_cmd_resv_release))
-
-#define show_opcode_name(qid, opcode) \
- (qid ? show_nvm_opcode_name(opcode) : show_admin_opcode_name(opcode))
-
const char *nvme_trace_parse_admin_cmd(struct trace_seq *p, u8 opcode,
u8 *cdw10);
const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p, u8 opcode,
u8 *cdw10);
+const char *nvme_trace_parse_fabrics_cmd(struct trace_seq *p, u8 fctype,
+ u8 *spc);
-#define parse_nvme_cmd(qid, opcode, cdw10) \
- (qid ? \
- nvme_trace_parse_nvm_cmd(p, opcode, cdw10) : \
- nvme_trace_parse_admin_cmd(p, opcode, cdw10))
+#define parse_nvme_cmd(qid, opcode, fctype, cdw10) \
+ ((opcode) == nvme_fabrics_command ? \
+ nvme_trace_parse_fabrics_cmd(p, fctype, cdw10) : \
+ ((qid) ? \
+ nvme_trace_parse_nvm_cmd(p, opcode, cdw10) : \
+ nvme_trace_parse_admin_cmd(p, opcode, cdw10)))
const char *nvme_trace_disk_name(struct trace_seq *p, char *name);
#define __print_disk_name(name) \
__field(int, qid)
__field(u8, opcode)
__field(u8, flags)
+ __field(u8, fctype)
__field(u16, cid)
__field(u32, nsid)
__field(u64, metadata)
__entry->cid = cmd->common.command_id;
__entry->nsid = le32_to_cpu(cmd->common.nsid);
__entry->metadata = le64_to_cpu(cmd->common.metadata);
+ __entry->fctype = cmd->fabrics.fctype;
__assign_disk_name(__entry->disk, req->rq_disk);
memcpy(__entry->cdw10, &cmd->common.cdw10,
sizeof(__entry->cdw10));
__entry->ctrl_id, __print_disk_name(__entry->disk),
__entry->qid, __entry->cid, __entry->nsid,
__entry->flags, __entry->metadata,
- show_opcode_name(__entry->qid, __entry->opcode),
- parse_nvme_cmd(__entry->qid, __entry->opcode, __entry->cdw10))
+ show_opcode_name(__entry->qid, __entry->opcode,
+ __entry->fctype),
+ parse_nvme_cmd(__entry->qid, __entry->opcode,
+ __entry->fctype, __entry->cdw10))
);
TRACE_EVENT(nvme_complete_rq,
__entry->status = nvme_req(req)->status;
__assign_disk_name(__entry->disk, req->rq_disk);
),
- TP_printk("nvme%d: %sqid=%d, cmdid=%u, res=%llu, retries=%u, flags=0x%x, status=%u",
+ TP_printk("nvme%d: %sqid=%d, cmdid=%u, res=%#llx, retries=%u, flags=0x%x, status=%#x",
__entry->ctrl_id, __print_disk_name(__entry->disk),
__entry->qid, __entry->cid, __entry->result,
__entry->retries, __entry->flags, __entry->status)
# SPDX-License-Identifier: GPL-2.0
+ccflags-y += -I$(src)
+
obj-$(CONFIG_NVME_TARGET) += nvmet.o
obj-$(CONFIG_NVME_TARGET_LOOP) += nvme-loop.o
obj-$(CONFIG_NVME_TARGET_RDMA) += nvmet-rdma.o
nvmet-fc-y += fc.o
nvme-fcloop-y += fcloop.o
nvmet-tcp-y += tcp.o
+nvmet-$(CONFIG_TRACING) += trace.o
#include <linux/pci-p2pdma.h>
#include <linux/scatterlist.h>
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
#include "nvmet.h"
struct workqueue_struct *buffered_io_wq;
port->inline_data_size = 0;
port->enabled = true;
+ port->tr_ops = ops;
return 0;
}
lockdep_assert_held(&nvmet_config_sem);
port->enabled = false;
+ port->tr_ops = NULL;
ops = nvmet_transports[port->disc_addr.trtype];
ops->remove_port(port);
if (unlikely(status))
nvmet_set_error(req, status);
+
+ trace_nvmet_req_complete(req);
+
if (req->ns)
nvmet_put_namespace(req->ns);
req->ops->queue_response(req);
req->error_loc = NVMET_NO_ERROR_LOC;
req->error_slba = 0;
+ trace_nvmet_req_init(req, req->cmd);
+
/* no support for fused commands yet */
if (unlikely(flags & (NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND))) {
req->error_loc = offsetof(struct nvme_common_command, flags);
status = nvmet_parse_connect_cmd(req);
else if (likely(req->sq->qid != 0))
status = nvmet_parse_io_cmd(req);
- else if (req->cmd->common.opcode == nvme_fabrics_command)
+ else if (nvme_is_fabrics(req->cmd))
status = nvmet_parse_fabrics_cmd(req);
else if (req->sq->ctrl->subsys->type == NVME_NQN_DISC)
status = nvmet_parse_discovery_cmd(req);
__nvmet_disc_changed(port, ctrl);
}
mutex_unlock(&nvmet_disc_subsys->lock);
+
+ /* If transport can signal change, notify transport */
+ if (port->tr_ops && port->tr_ops->discovery_chg)
+ port->tr_ops->discovery_chg(port);
}
static void __nvmet_subsys_disc_changed(struct nvmet_port *port,
{
struct nvme_command *cmd = req->cmd;
- if (cmd->common.opcode != nvme_fabrics_command) {
+ if (!nvme_is_fabrics(cmd)) {
pr_err("invalid command 0x%x on unconnected queue.\n",
cmd->fabrics.opcode);
req->error_loc = offsetof(struct nvme_common_command, opcode);
*/
rspcnt = atomic_inc_return(&fod->queue->zrspcnt);
if (!(rspcnt % fod->queue->ersp_ratio) ||
- sqe->opcode == nvme_fabrics_command ||
+ nvme_is_fabrics((struct nvme_command *) sqe) ||
xfr_length != fod->req.transfer_len ||
(le16_to_cpu(cqe->status) & 0xFFFE) || cqewd[0] || cqewd[1] ||
(sqe->flags & (NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND)) ||
kfree(pe);
}
+static void
+nvmet_fc_discovery_chg(struct nvmet_port *port)
+{
+ struct nvmet_fc_port_entry *pe = port->priv;
+ struct nvmet_fc_tgtport *tgtport = pe->tgtport;
+
+ if (tgtport && tgtport->ops->discovery_event)
+ tgtport->ops->discovery_event(&tgtport->fc_target_port);
+}
+
static const struct nvmet_fabrics_ops nvmet_fc_tgt_fcp_ops = {
.owner = THIS_MODULE,
.type = NVMF_TRTYPE_FC,
.remove_port = nvmet_fc_remove_port,
.queue_response = nvmet_fc_fcp_nvme_cmd_done,
.delete_ctrl = nvmet_fc_delete_ctrl,
+ .discovery_chg = nvmet_fc_discovery_chg,
};
static int __init nvmet_fc_init_module(void)
int status;
};
+struct fcloop_rscn {
+ struct fcloop_tport *tport;
+ struct work_struct work;
+};
+
enum {
INI_IO_START = 0,
INI_IO_ACTIVE = 1,
return 0;
}
+/*
+ * Simulate reception of RSCN and converting it to a initiator transport
+ * call to rescan a remote port.
+ */
+static void
+fcloop_tgt_rscn_work(struct work_struct *work)
+{
+ struct fcloop_rscn *tgt_rscn =
+ container_of(work, struct fcloop_rscn, work);
+ struct fcloop_tport *tport = tgt_rscn->tport;
+
+ if (tport->remoteport)
+ nvme_fc_rescan_remoteport(tport->remoteport);
+ kfree(tgt_rscn);
+}
+
+static void
+fcloop_tgt_discovery_evt(struct nvmet_fc_target_port *tgtport)
+{
+ struct fcloop_rscn *tgt_rscn;
+
+ tgt_rscn = kzalloc(sizeof(*tgt_rscn), GFP_KERNEL);
+ if (!tgt_rscn)
+ return;
+
+ tgt_rscn->tport = tgtport->private;
+ INIT_WORK(&tgt_rscn->work, fcloop_tgt_rscn_work);
+
+ schedule_work(&tgt_rscn->work);
+}
+
static void
fcloop_tfcp_req_free(struct kref *ref)
{
.fcp_op = fcloop_fcp_op,
.fcp_abort = fcloop_tgt_fcp_abort,
.fcp_req_release = fcloop_fcp_req_release,
+ .discovery_event = fcloop_tgt_discovery_evt,
.max_hw_queues = FCLOOP_HW_QUEUES,
.max_sgl_segments = FCLOOP_SGL_SEGS,
.max_dif_sgl_segments = FCLOOP_SGL_SEGS,
void *priv;
bool enabled;
int inline_data_size;
+ const struct nvmet_fabrics_ops *tr_ops;
};
static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
void (*disc_traddr)(struct nvmet_req *req,
struct nvmet_port *port, char *traddr);
u16 (*install_queue)(struct nvmet_sq *nvme_sq);
+ void (*discovery_chg)(struct nvmet_port *port);
};
#define NVMET_MAX_INLINE_BIOVEC 8
--- /dev/null
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVM Express target device driver tracepoints
+ * Copyright (c) 2018 Johannes Thumshirn, SUSE Linux GmbH
+ */
+
+#include <asm/unaligned.h>
+#include "trace.h"
+
+static const char *nvmet_trace_admin_identify(struct trace_seq *p, u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u8 cns = cdw10[0];
+ u16 ctrlid = get_unaligned_le16(cdw10 + 2);
+
+ trace_seq_printf(p, "cns=%u, ctrlid=%u", cns, ctrlid);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
+static const char *nvmet_trace_admin_get_features(struct trace_seq *p,
+ u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u8 fid = cdw10[0];
+ u8 sel = cdw10[1] & 0x7;
+ u32 cdw11 = get_unaligned_le32(cdw10 + 4);
+
+ trace_seq_printf(p, "fid=0x%x sel=0x%x cdw11=0x%x", fid, sel, cdw11);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
+static const char *nvmet_trace_read_write(struct trace_seq *p, u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u64 slba = get_unaligned_le64(cdw10);
+ u16 length = get_unaligned_le16(cdw10 + 8);
+ u16 control = get_unaligned_le16(cdw10 + 10);
+ u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+ u32 reftag = get_unaligned_le32(cdw10 + 16);
+
+ trace_seq_printf(p,
+ "slba=%llu, len=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+ slba, length, control, dsmgmt, reftag);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
+static const char *nvmet_trace_dsm(struct trace_seq *p, u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+
+ trace_seq_printf(p, "nr=%u, attributes=%u",
+ get_unaligned_le32(cdw10),
+ get_unaligned_le32(cdw10 + 4));
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
+static const char *nvmet_trace_common(struct trace_seq *p, u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+
+ trace_seq_printf(p, "cdw10=%*ph", 24, cdw10);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
+const char *nvmet_trace_parse_admin_cmd(struct trace_seq *p,
+ u8 opcode, u8 *cdw10)
+{
+ switch (opcode) {
+ case nvme_admin_identify:
+ return nvmet_trace_admin_identify(p, cdw10);
+ case nvme_admin_get_features:
+ return nvmet_trace_admin_get_features(p, cdw10);
+ default:
+ return nvmet_trace_common(p, cdw10);
+ }
+}
+
+const char *nvmet_trace_parse_nvm_cmd(struct trace_seq *p,
+ u8 opcode, u8 *cdw10)
+{
+ switch (opcode) {
+ case nvme_cmd_read:
+ case nvme_cmd_write:
+ case nvme_cmd_write_zeroes:
+ return nvmet_trace_read_write(p, cdw10);
+ case nvme_cmd_dsm:
+ return nvmet_trace_dsm(p, cdw10);
+ default:
+ return nvmet_trace_common(p, cdw10);
+ }
+}
+
+static const char *nvmet_trace_fabrics_property_set(struct trace_seq *p,
+ u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u8 attrib = spc[0];
+ u32 ofst = get_unaligned_le32(spc + 4);
+ u64 value = get_unaligned_le64(spc + 8);
+
+ trace_seq_printf(p, "attrib=%u, ofst=0x%x, value=0x%llx",
+ attrib, ofst, value);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+static const char *nvmet_trace_fabrics_connect(struct trace_seq *p,
+ u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u16 recfmt = get_unaligned_le16(spc);
+ u16 qid = get_unaligned_le16(spc + 2);
+ u16 sqsize = get_unaligned_le16(spc + 4);
+ u8 cattr = spc[6];
+ u32 kato = get_unaligned_le32(spc + 8);
+
+ trace_seq_printf(p, "recfmt=%u, qid=%u, sqsize=%u, cattr=%u, kato=%u",
+ recfmt, qid, sqsize, cattr, kato);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+static const char *nvmet_trace_fabrics_property_get(struct trace_seq *p,
+ u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u8 attrib = spc[0];
+ u32 ofst = get_unaligned_le32(spc + 4);
+
+ trace_seq_printf(p, "attrib=%u, ofst=0x%x", attrib, ofst);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+static const char *nvmet_trace_fabrics_common(struct trace_seq *p, u8 *spc)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+
+ trace_seq_printf(p, "spcecific=%*ph", 24, spc);
+ trace_seq_putc(p, 0);
+ return ret;
+}
+
+const char *nvmet_trace_parse_fabrics_cmd(struct trace_seq *p,
+ u8 fctype, u8 *spc)
+{
+ switch (fctype) {
+ case nvme_fabrics_type_property_set:
+ return nvmet_trace_fabrics_property_set(p, spc);
+ case nvme_fabrics_type_connect:
+ return nvmet_trace_fabrics_connect(p, spc);
+ case nvme_fabrics_type_property_get:
+ return nvmet_trace_fabrics_property_get(p, spc);
+ default:
+ return nvmet_trace_fabrics_common(p, spc);
+ }
+}
+
+const char *nvmet_trace_disk_name(struct trace_seq *p, char *name)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+
+ if (*name)
+ trace_seq_printf(p, "disk=%s, ", name);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
+const char *nvmet_trace_ctrl_name(struct trace_seq *p, struct nvmet_ctrl *ctrl)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+
+ /*
+ * XXX: We don't know the controller instance before executing the
+ * connect command itself because the connect command for the admin
+ * queue will not provide the cntlid which will be allocated in this
+ * command. In case of io queues, the controller instance will be
+ * mapped by the extra data of the connect command.
+ * If we can know the extra data of the connect command in this stage,
+ * we can update this print statement later.
+ */
+ if (ctrl)
+ trace_seq_printf(p, "%d", ctrl->cntlid);
+ else
+ trace_seq_printf(p, "_");
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
--- /dev/null
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * NVM Express target device driver tracepoints
+ * Copyright (c) 2018 Johannes Thumshirn, SUSE Linux GmbH
+ *
+ * This is entirely based on drivers/nvme/host/trace.h
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nvmet
+
+#if !defined(_TRACE_NVMET_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NVMET_H
+
+#include <linux/nvme.h>
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "nvmet.h"
+
+const char *nvmet_trace_parse_admin_cmd(struct trace_seq *p, u8 opcode,
+ u8 *cdw10);
+const char *nvmet_trace_parse_nvm_cmd(struct trace_seq *p, u8 opcode,
+ u8 *cdw10);
+const char *nvmet_trace_parse_fabrics_cmd(struct trace_seq *p, u8 fctype,
+ u8 *spc);
+
+#define parse_nvme_cmd(qid, opcode, fctype, cdw10) \
+ ((opcode) == nvme_fabrics_command ? \
+ nvmet_trace_parse_fabrics_cmd(p, fctype, cdw10) : \
+ (qid ? \
+ nvmet_trace_parse_nvm_cmd(p, opcode, cdw10) : \
+ nvmet_trace_parse_admin_cmd(p, opcode, cdw10)))
+
+const char *nvmet_trace_ctrl_name(struct trace_seq *p, struct nvmet_ctrl *ctrl);
+#define __print_ctrl_name(ctrl) \
+ nvmet_trace_ctrl_name(p, ctrl)
+
+const char *nvmet_trace_disk_name(struct trace_seq *p, char *name);
+#define __print_disk_name(name) \
+ nvmet_trace_disk_name(p, name)
+
+#ifndef TRACE_HEADER_MULTI_READ
+static inline struct nvmet_ctrl *nvmet_req_to_ctrl(struct nvmet_req *req)
+{
+ return req->sq->ctrl;
+}
+
+static inline void __assign_disk_name(char *name, struct nvmet_req *req,
+ bool init)
+{
+ struct nvmet_ctrl *ctrl = nvmet_req_to_ctrl(req);
+ struct nvmet_ns *ns;
+
+ if ((init && req->sq->qid) || (!init && req->cq->qid)) {
+ ns = nvmet_find_namespace(ctrl, req->cmd->rw.nsid);
+ strncpy(name, ns->device_path, DISK_NAME_LEN);
+ return;
+ }
+
+ memset(name, 0, DISK_NAME_LEN);
+}
+#endif
+
+TRACE_EVENT(nvmet_req_init,
+ TP_PROTO(struct nvmet_req *req, struct nvme_command *cmd),
+ TP_ARGS(req, cmd),
+ TP_STRUCT__entry(
+ __field(struct nvme_command *, cmd)
+ __field(struct nvmet_ctrl *, ctrl)
+ __array(char, disk, DISK_NAME_LEN)
+ __field(int, qid)
+ __field(u16, cid)
+ __field(u8, opcode)
+ __field(u8, fctype)
+ __field(u8, flags)
+ __field(u32, nsid)
+ __field(u64, metadata)
+ __array(u8, cdw10, 24)
+ ),
+ TP_fast_assign(
+ __entry->cmd = cmd;
+ __entry->ctrl = nvmet_req_to_ctrl(req);
+ __assign_disk_name(__entry->disk, req, true);
+ __entry->qid = req->sq->qid;
+ __entry->cid = cmd->common.command_id;
+ __entry->opcode = cmd->common.opcode;
+ __entry->fctype = cmd->fabrics.fctype;
+ __entry->flags = cmd->common.flags;
+ __entry->nsid = le32_to_cpu(cmd->common.nsid);
+ __entry->metadata = le64_to_cpu(cmd->common.metadata);
+ memcpy(__entry->cdw10, &cmd->common.cdw10,
+ sizeof(__entry->cdw10));
+ ),
+ TP_printk("nvmet%s: %sqid=%d, cmdid=%u, nsid=%u, flags=%#x, "
+ "meta=%#llx, cmd=(%s, %s)",
+ __print_ctrl_name(__entry->ctrl),
+ __print_disk_name(__entry->disk),
+ __entry->qid, __entry->cid, __entry->nsid,
+ __entry->flags, __entry->metadata,
+ show_opcode_name(__entry->qid, __entry->opcode,
+ __entry->fctype),
+ parse_nvme_cmd(__entry->qid, __entry->opcode,
+ __entry->fctype, __entry->cdw10))
+);
+
+TRACE_EVENT(nvmet_req_complete,
+ TP_PROTO(struct nvmet_req *req),
+ TP_ARGS(req),
+ TP_STRUCT__entry(
+ __field(struct nvmet_ctrl *, ctrl)
+ __array(char, disk, DISK_NAME_LEN)
+ __field(int, qid)
+ __field(int, cid)
+ __field(u64, result)
+ __field(u16, status)
+ ),
+ TP_fast_assign(
+ __entry->ctrl = nvmet_req_to_ctrl(req);
+ __entry->qid = req->cq->qid;
+ __entry->cid = req->cqe->command_id;
+ __entry->result = le64_to_cpu(req->cqe->result.u64);
+ __entry->status = le16_to_cpu(req->cqe->status) >> 1;
+ __assign_disk_name(__entry->disk, req, false);
+ ),
+ TP_printk("nvmet%s: %sqid=%d, cmdid=%u, res=%#llx, status=%#x",
+ __print_ctrl_name(__entry->ctrl),
+ __print_disk_name(__entry->disk),
+ __entry->qid, __entry->cid, __entry->result, __entry->status)
+
+);
+
+#endif /* _TRACE_NVMET_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
uint32_t elsXmitADISC;
uint32_t elsXmitLOGO;
uint32_t elsXmitSCR;
+ uint32_t elsXmitRSCN;
uint32_t elsXmitRNID;
uint32_t elsXmitFARP;
uint32_t elsXmitFARPR;
uint32_t cfg_use_msi;
uint32_t cfg_auto_imax;
uint32_t cfg_fcp_imax;
+ uint32_t cfg_force_rscn;
uint32_t cfg_cq_poll_threshold;
uint32_t cfg_cq_max_proc_limit;
uint32_t cfg_fcp_cpu_map;
lpfc_request_firmware_upgrade_store);
/**
+ * lpfc_force_rscn_store
+ *
+ * @dev: class device that is converted into a Scsi_host.
+ * @attr: device attribute, not used.
+ * @buf: unused string
+ * @count: unused variable.
+ *
+ * Description:
+ * Force the switch to send a RSCN to all other NPorts in our zone
+ * If we are direct connect pt2pt, build the RSCN command ourself
+ * and send to the other NPort. Not supported for private loop.
+ *
+ * Returns:
+ * 0 - on success
+ * -EIO - if command is not sent
+ **/
+static ssize_t
+lpfc_force_rscn_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct Scsi_Host *shost = class_to_shost(dev);
+ struct lpfc_vport *vport = (struct lpfc_vport *)shost->hostdata;
+ int i;
+
+ i = lpfc_issue_els_rscn(vport, 0);
+ if (i)
+ return -EIO;
+ return strlen(buf);
+}
+
+/*
+ * lpfc_force_rscn: Force an RSCN to be sent to all remote NPorts
+ * connected to the HBA.
+ *
+ * Value range is any ascii value
+ */
+static int lpfc_force_rscn;
+module_param(lpfc_force_rscn, int, 0644);
+MODULE_PARM_DESC(lpfc_force_rscn,
+ "Force an RSCN to be sent to all remote NPorts");
+lpfc_param_show(force_rscn)
+
+/**
+ * lpfc_force_rscn_init - Force an RSCN to be sent to all remote NPorts
+ * @phba: lpfc_hba pointer.
+ * @val: unused value.
+ *
+ * Returns:
+ * zero if val saved.
+ **/
+static int
+lpfc_force_rscn_init(struct lpfc_hba *phba, int val)
+{
+ return 0;
+}
+static DEVICE_ATTR_RW(lpfc_force_rscn);
+
+/**
* lpfc_fcp_imax_store
*
* @dev: class device that is converted into a Scsi_host.
&dev_attr_lpfc_nvme_oas,
&dev_attr_lpfc_nvme_embed_cmd,
&dev_attr_lpfc_fcp_imax,
+ &dev_attr_lpfc_force_rscn,
&dev_attr_lpfc_cq_poll_threshold,
&dev_attr_lpfc_cq_max_proc_limit,
&dev_attr_lpfc_fcp_cpu_map,
lpfc_nvme_oas_init(phba, lpfc_nvme_oas);
lpfc_nvme_embed_cmd_init(phba, lpfc_nvme_embed_cmd);
lpfc_fcp_imax_init(phba, lpfc_fcp_imax);
+ lpfc_force_rscn_init(phba, lpfc_force_rscn);
lpfc_cq_poll_threshold_init(phba, lpfc_cq_poll_threshold);
lpfc_cq_max_proc_limit_init(phba, lpfc_cq_max_proc_limit);
lpfc_fcp_cpu_map_init(phba, lpfc_fcp_cpu_map);
int lpfc_issue_els_logo(struct lpfc_vport *, struct lpfc_nodelist *, uint8_t);
int lpfc_issue_els_npiv_logo(struct lpfc_vport *, struct lpfc_nodelist *);
int lpfc_issue_els_scr(struct lpfc_vport *, uint32_t, uint8_t);
+int lpfc_issue_els_rscn(struct lpfc_vport *vport, uint8_t retry);
int lpfc_issue_fabric_reglogin(struct lpfc_vport *);
int lpfc_els_free_iocb(struct lpfc_hba *, struct lpfc_iocbq *);
int lpfc_ct_free_iocb(struct lpfc_hba *, struct lpfc_iocbq *);
struct lpfc_nodelist *lpfc_findnode_did(struct lpfc_vport *, uint32_t);
struct lpfc_nodelist *lpfc_findnode_wwpn(struct lpfc_vport *,
struct lpfc_name *);
+struct lpfc_nodelist *lpfc_findnode_mapped(struct lpfc_vport *vport);
int lpfc_sli_issue_mbox_wait(struct lpfc_hba *, LPFC_MBOXQ_t *, uint32_t);
int lpfc_check_fwlog_support(struct lpfc_hba *phba);
/* NVME interfaces. */
+void lpfc_nvme_rescan_port(struct lpfc_vport *vport,
+ struct lpfc_nodelist *ndlp);
void lpfc_nvme_unregister_port(struct lpfc_vport *vport,
struct lpfc_nodelist *ndlp);
int lpfc_nvme_register_port(struct lpfc_vport *vport,
#include <scsi/scsi_device.h>
#include <scsi/scsi_host.h>
#include <scsi/scsi_transport_fc.h>
+#include <uapi/scsi/fc/fc_fs.h>
+#include <uapi/scsi/fc/fc_els.h>
#include "lpfc_hw4.h"
#include "lpfc_hw.h"
}
/**
+ * lpfc_issue_els_rscn - Issue an RSCN to the Fabric Controller (Fabric)
+ * or the other nport (pt2pt).
+ * @vport: pointer to a host virtual N_Port data structure.
+ * @retry: number of retries to the command IOCB.
+ *
+ * This routine issues a RSCN to the Fabric Controller (DID 0xFFFFFD)
+ * when connected to a fabric, or to the remote port when connected
+ * in point-to-point mode. When sent to the Fabric Controller, it will
+ * replay the RSCN to registered recipients.
+ *
+ * Note that, in lpfc_prep_els_iocb() routine, the reference count of ndlp
+ * will be incremented by 1 for holding the ndlp and the reference to ndlp
+ * will be stored into the context1 field of the IOCB for the completion
+ * callback function to the RSCN ELS command.
+ *
+ * Return code
+ * 0 - Successfully issued RSCN command
+ * 1 - Failed to issue RSCN command
+ **/
+int
+lpfc_issue_els_rscn(struct lpfc_vport *vport, uint8_t retry)
+{
+ struct lpfc_hba *phba = vport->phba;
+ struct lpfc_iocbq *elsiocb;
+ struct lpfc_nodelist *ndlp;
+ struct {
+ struct fc_els_rscn rscn;
+ struct fc_els_rscn_page portid;
+ } *event;
+ uint32_t nportid;
+ uint16_t cmdsize = sizeof(*event);
+
+ /* Not supported for private loop */
+ if (phba->fc_topology == LPFC_TOPOLOGY_LOOP &&
+ !(vport->fc_flag & FC_PUBLIC_LOOP))
+ return 1;
+
+ if (vport->fc_flag & FC_PT2PT) {
+ /* find any mapped nport - that would be the other nport */
+ ndlp = lpfc_findnode_mapped(vport);
+ if (!ndlp)
+ return 1;
+ } else {
+ nportid = FC_FID_FCTRL;
+ /* find the fabric controller node */
+ ndlp = lpfc_findnode_did(vport, nportid);
+ if (!ndlp) {
+ /* if one didn't exist, make one */
+ ndlp = lpfc_nlp_init(vport, nportid);
+ if (!ndlp)
+ return 1;
+ lpfc_enqueue_node(vport, ndlp);
+ } else if (!NLP_CHK_NODE_ACT(ndlp)) {
+ ndlp = lpfc_enable_node(vport, ndlp,
+ NLP_STE_UNUSED_NODE);
+ if (!ndlp)
+ return 1;
+ }
+ }
+
+ elsiocb = lpfc_prep_els_iocb(vport, 1, cmdsize, retry, ndlp,
+ ndlp->nlp_DID, ELS_CMD_RSCN_XMT);
+
+ if (!elsiocb) {
+ /* This will trigger the release of the node just
+ * allocated
+ */
+ lpfc_nlp_put(ndlp);
+ return 1;
+ }
+
+ event = ((struct lpfc_dmabuf *)elsiocb->context2)->virt;
+
+ event->rscn.rscn_cmd = ELS_RSCN;
+ event->rscn.rscn_page_len = sizeof(struct fc_els_rscn_page);
+ event->rscn.rscn_plen = cpu_to_be16(cmdsize);
+
+ nportid = vport->fc_myDID;
+ /* appears that page flags must be 0 for fabric to broadcast RSCN */
+ event->portid.rscn_page_flags = 0;
+ event->portid.rscn_fid[0] = (nportid & 0x00FF0000) >> 16;
+ event->portid.rscn_fid[1] = (nportid & 0x0000FF00) >> 8;
+ event->portid.rscn_fid[2] = nportid & 0x000000FF;
+
+ lpfc_debugfs_disc_trc(vport, LPFC_DISC_TRC_ELS_CMD,
+ "Issue RSCN: did:x%x",
+ ndlp->nlp_DID, 0, 0);
+
+ phba->fc_stat.elsXmitRSCN++;
+ elsiocb->iocb_cmpl = lpfc_cmpl_els_cmd;
+ if (lpfc_sli_issue_iocb(phba, LPFC_ELS_RING, elsiocb, 0) ==
+ IOCB_ERROR) {
+ /* The additional lpfc_nlp_put will cause the following
+ * lpfc_els_free_iocb routine to trigger the rlease of
+ * the node.
+ */
+ lpfc_nlp_put(ndlp);
+ lpfc_els_free_iocb(phba, elsiocb);
+ return 1;
+ }
+ /* This will cause the callback-function lpfc_cmpl_els_cmd to
+ * trigger the release of node.
+ */
+ if (!(vport->fc_flag & FC_PT2PT))
+ lpfc_nlp_put(ndlp);
+
+ return 0;
+}
+
+/**
* lpfc_issue_els_farpr - Issue a farp to an node on a vport
* @vport: pointer to a host virtual N_Port data structure.
* @nportid: N_Port identifier to the remote node.
continue;
}
+ if (ndlp->nlp_fc4_type & NLP_FC4_NVME)
+ lpfc_nvme_rescan_port(vport, ndlp);
lpfc_disc_state_machine(vport, ndlp, NULL,
NLP_EVT_DEVICE_RECOVERY);
fc_host_post_event(shost, fc_get_event_number(),
FCH_EVT_RSCN, lp[i]);
+ /* Check if RSCN is coming from a direct-connected remote NPort */
+ if (vport->fc_flag & FC_PT2PT) {
+ /* If so, just ACC it, no other action needed for now */
+ lpfc_printf_vlog(vport, KERN_INFO, LOG_ELS,
+ "2024 pt2pt RSCN %08x Data: x%x x%x\n",
+ *lp, vport->fc_flag, payload_len);
+ lpfc_els_rsp_acc(vport, ELS_CMD_ACC, cmdiocb, ndlp, NULL);
+
+ if (ndlp->nlp_fc4_type & NLP_FC4_NVME)
+ lpfc_nvme_rescan_port(vport, ndlp);
+ return 0;
+ }
+
/* If we are about to begin discovery, just ACC the RSCN.
* Discovery processing will satisfy it.
*/
}
struct lpfc_nodelist *
+lpfc_findnode_mapped(struct lpfc_vport *vport)
+{
+ struct Scsi_Host *shost = lpfc_shost_from_vport(vport);
+ struct lpfc_nodelist *ndlp;
+ uint32_t data1;
+ unsigned long iflags;
+
+ spin_lock_irqsave(shost->host_lock, iflags);
+
+ list_for_each_entry(ndlp, &vport->fc_nodes, nlp_listp) {
+ if (ndlp->nlp_state == NLP_STE_UNMAPPED_NODE ||
+ ndlp->nlp_state == NLP_STE_MAPPED_NODE) {
+ data1 = (((uint32_t)ndlp->nlp_state << 24) |
+ ((uint32_t)ndlp->nlp_xri << 16) |
+ ((uint32_t)ndlp->nlp_type << 8) |
+ ((uint32_t)ndlp->nlp_rpi & 0xff));
+ spin_unlock_irqrestore(shost->host_lock, iflags);
+ lpfc_printf_vlog(vport, KERN_INFO, LOG_NODE,
+ "2025 FIND node DID "
+ "Data: x%p x%x x%x x%x %p\n",
+ ndlp, ndlp->nlp_DID,
+ ndlp->nlp_flag, data1,
+ ndlp->active_rrqs_xri_bitmap);
+ return ndlp;
+ }
+ }
+ spin_unlock_irqrestore(shost->host_lock, iflags);
+
+ /* FIND node did <did> NOT FOUND */
+ lpfc_printf_vlog(vport, KERN_INFO, LOG_NODE,
+ "2026 FIND mapped did NOT FOUND.\n");
+ return NULL;
+}
+
+struct lpfc_nodelist *
lpfc_setup_disc_node(struct lpfc_vport *vport, uint32_t did)
{
struct Scsi_Host *shost = lpfc_shost_from_vport(vport);
#define ELS_CMD_RPL 0x57000000
#define ELS_CMD_FAN 0x60000000
#define ELS_CMD_RSCN 0x61040000
+#define ELS_CMD_RSCN_XMT 0x61040008
#define ELS_CMD_SCR 0x62000000
#define ELS_CMD_RNID 0x78000000
#define ELS_CMD_LIRR 0x7A000000
#define ELS_CMD_RPL 0x57
#define ELS_CMD_FAN 0x60
#define ELS_CMD_RSCN 0x0461
+#define ELS_CMD_RSCN_XMT 0x08000461
#define ELS_CMD_SCR 0x62
#define ELS_CMD_RNID 0x78
#define ELS_CMD_LIRR 0x7A
#endif
}
+/**
+ * lpfc_nvme_rescan_port - Check to see if we should rescan this remoteport
+ *
+ * If the ndlp represents an NVME Target, that we are logged into,
+ * ping the NVME FC Transport layer to initiate a device rescan
+ * on this remote NPort.
+ */
+void
+lpfc_nvme_rescan_port(struct lpfc_vport *vport, struct lpfc_nodelist *ndlp)
+{
+#if (IS_ENABLED(CONFIG_NVME_FC))
+ struct lpfc_nvme_rport *rport;
+ struct nvme_fc_remote_port *remoteport;
+
+ rport = ndlp->nrport;
+
+ lpfc_printf_vlog(vport, KERN_INFO, LOG_NVME_DISC,
+ "6170 Rescan NPort DID x%06x type x%x "
+ "state x%x rport %p\n",
+ ndlp->nlp_DID, ndlp->nlp_type, ndlp->nlp_state, rport);
+ if (!rport)
+ goto input_err;
+ remoteport = rport->remoteport;
+ if (!remoteport)
+ goto input_err;
+
+ /* Only rescan if we are an NVME target in the MAPPED state */
+ if (remoteport->port_role & FC_PORT_ROLE_NVME_DISCOVERY &&
+ ndlp->nlp_state == NLP_STE_MAPPED_NODE) {
+ nvme_fc_rescan_remoteport(remoteport);
+
+ lpfc_printf_vlog(vport, KERN_ERR, LOG_NVME_DISC,
+ "6172 NVME rescanned DID x%06x "
+ "port_state x%x\n",
+ ndlp->nlp_DID, remoteport->port_state);
+ }
+ return;
+input_err:
+ lpfc_printf_vlog(vport, KERN_ERR, LOG_NVME_DISC,
+ "6169 State error: lport %p, rport%p FCID x%06x\n",
+ vport->localport, ndlp->rport, ndlp->nlp_DID);
+#endif
+}
+
/* lpfc_nvme_unregister_port - unbind the DID and port_role from this rport.
*
* There is no notion of Devloss or rport recovery from the current
spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
}
+static void
+lpfc_nvmet_discovery_event(struct nvmet_fc_target_port *tgtport)
+{
+ struct lpfc_nvmet_tgtport *tgtp;
+ struct lpfc_hba *phba;
+ uint32_t rc;
+
+ tgtp = tgtport->private;
+ phba = tgtp->phba;
+
+ rc = lpfc_issue_els_rscn(phba->pport, 0);
+ lpfc_printf_log(phba, KERN_ERR, LOG_NVME,
+ "6420 NVMET subsystem change: Notification %s\n",
+ (rc) ? "Failed" : "Sent");
+}
+
static struct nvmet_fc_target_template lpfc_tgttemplate = {
.targetport_delete = lpfc_nvmet_targetport_delete,
.xmt_ls_rsp = lpfc_nvmet_xmt_ls_rsp,
.fcp_abort = lpfc_nvmet_xmt_fcp_abort,
.fcp_req_release = lpfc_nvmet_xmt_fcp_release,
.defer_rcv = lpfc_nvmet_defer_rcv,
+ .discovery_event = lpfc_nvmet_discovery_event,
.max_hw_queues = 1,
.max_sgl_segments = LPFC_NVMET_DEFAULT_SEGS,
if (if_type >= LPFC_SLI_INTF_IF_TYPE_2) {
if (pcmd && (*pcmd == ELS_CMD_FLOGI ||
*pcmd == ELS_CMD_SCR ||
+ *pcmd == ELS_CMD_RSCN_XMT ||
*pcmd == ELS_CMD_FDISC ||
*pcmd == ELS_CMD_LOGO ||
*pcmd == ELS_CMD_PLOGI)) {
{
struct file *file = iocb->ki_filp;
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
- struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs, *bvec;
+ struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
loff_t pos = iocb->ki_pos;
bool should_dirty = false;
struct bio bio;
ssize_t ret;
blk_qc_t qc;
- struct bvec_iter_all iter_all;
if ((pos | iov_iter_alignment(iter)) &
(bdev_logical_block_size(bdev) - 1))
}
__set_current_state(TASK_RUNNING);
- bio_for_each_segment_all(bvec, &bio, iter_all) {
- if (should_dirty && !PageCompound(bvec->bv_page))
- set_page_dirty_lock(bvec->bv_page);
- if (!bio_flagged(&bio, BIO_NO_PAGE_REF))
- put_page(bvec->bv_page);
- }
-
+ bio_release_pages(&bio, should_dirty);
if (unlikely(bio.bi_status))
ret = blk_status_to_errno(bio.bi_status);
if (should_dirty) {
bio_check_pages_dirty(bio);
} else {
- if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
- struct bvec_iter_all iter_all;
- struct bio_vec *bvec;
-
- bio_for_each_segment_all(bvec, bio, iter_all)
- put_page(bvec->bv_page);
- }
+ bio_release_pages(bio, false);
bio_put(bio);
}
}
*/
static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
{
- struct bio_vec *bvec;
blk_status_t err = bio->bi_status;
+ bool should_dirty = dio->op == REQ_OP_READ && dio->should_dirty;
if (err) {
if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT))
dio->io_error = -EIO;
}
- if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
+ if (dio->is_async && should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
- struct bvec_iter_all iter_all;
-
- bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
-
- if (dio->op == REQ_OP_READ && !PageCompound(page) &&
- dio->should_dirty)
- set_page_dirty_lock(page);
- put_page(page);
- }
+ bio_release_pages(bio, should_dirty);
bio_put(bio);
}
return err;
void wbc_account_io(struct writeback_control *wbc, struct page *page,
size_t bytes)
{
+ struct cgroup_subsys_state *css;
int id;
/*
if (!wbc->wb)
return;
- id = mem_cgroup_css_from_page(page)->id;
+ css = mem_cgroup_css_from_page(page);
+ /* dead cgroups shouldn't contribute to inode ownership arbitration */
+ if (!(css->flags & CSS_ONLINE))
+ return;
+
+ id = css->id;
if (id == wbc->wb_id) {
wbc->wb_bytes += bytes;
iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
if (offset)
iov_iter_advance(iter, offset);
-
- /* don't drop a reference to these pages */
- iter->type |= ITER_BVEC_FLAG_NO_REF;
return 0;
}
if (iop)
atomic_inc(&iop->read_count);
- if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
+ if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (should_dirty) {
bio_check_pages_dirty(bio);
} else {
- if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
- struct bvec_iter_all iter_all;
- struct bio_vec *bvec;
-
- bio_for_each_segment_all(bvec, bio, iter_all)
- put_page(bvec->bv_page);
- }
+ bio_release_pages(bio, false);
bio_put(bio);
}
}
atomic_inc(&iop->write_count);
if (!merged) {
- if (bio_full(wpc->ioend->io_bio))
+ if (bio_full(wpc->ioend->io_bio, len))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
return NULL;
}
-static inline bool bio_full(struct bio *bio)
+/**
+ * bio_full - check if the bio is full
+ * @bio: bio to check
+ * @len: length of one segment to be added
+ *
+ * Return true if @bio is full and one segment with @len bytes can't be
+ * added to the bio, otherwise return false
+ */
+static inline bool bio_full(struct bio *bio, unsigned len)
{
- return bio->bi_vcnt >= bio->bi_max_vecs;
+ if (bio->bi_vcnt >= bio->bi_max_vecs)
+ return true;
+
+ if (bio->bi_iter.bi_size > UINT_MAX - len)
+ return true;
+
+ return false;
}
static inline bool bio_next_segment(const struct bio *bio,
}
struct request_queue;
-extern int bio_phys_segments(struct request_queue *, struct bio *);
extern int submit_bio_wait(struct bio *bio);
extern void bio_advance(struct bio *, unsigned);
void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
+void bio_release_pages(struct bio *bio, bool mark_dirty);
struct rq_map_data;
extern struct bio *bio_map_user_iov(struct request_queue *,
struct iov_iter *, gfp_t);
struct hd_struct *part,
unsigned long start_time);
-#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
-# error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"
-#endif
-#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
-extern void bio_flush_dcache_pages(struct bio *bi);
-#else
-static inline void bio_flush_dcache_pages(struct bio *bi)
-{
-}
-#endif
-
extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
struct bio *src, struct bvec_iter *src_iter);
extern void bio_copy_data(struct bio *dst, struct bio *src);
/*
* blkg_[rw]stat->aux_cnt is excluded for local stats but included for
- * recursive. Used to carry stats of dead children, and, for blkg_rwstat,
- * to carry result values from read and sum operations.
+ * recursive. Used to carry stats of dead children.
*/
-struct blkg_stat {
- struct percpu_counter cpu_cnt;
- atomic64_t aux_cnt;
-};
-
struct blkg_rwstat {
struct percpu_counter cpu_cnt[BLKG_RWSTAT_NR];
atomic64_t aux_cnt[BLKG_RWSTAT_NR];
};
+struct blkg_rwstat_sample {
+ u64 cnt[BLKG_RWSTAT_NR];
+};
+
/*
* A blkcg_gq (blkg) is association between a block cgroup (blkcg) and a
* request_queue (q). This is used by blkcg policies which need to track
void blkcg_deactivate_policy(struct request_queue *q,
const struct blkcg_policy *pol);
+static inline u64 blkg_rwstat_read_counter(struct blkg_rwstat *rwstat,
+ unsigned int idx)
+{
+ return atomic64_read(&rwstat->aux_cnt[idx]) +
+ percpu_counter_sum_positive(&rwstat->cpu_cnt[idx]);
+}
+
const char *blkg_dev_name(struct blkcg_gq *blkg);
void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
u64 (*prfill)(struct seq_file *,
bool show_total);
u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd, u64 v);
u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
- const struct blkg_rwstat *rwstat);
-u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
+ const struct blkg_rwstat_sample *rwstat);
u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
int off);
int blkg_print_stat_bytes(struct seq_file *sf, void *v);
int blkg_print_stat_bytes_recursive(struct seq_file *sf, void *v);
int blkg_print_stat_ios_recursive(struct seq_file *sf, void *v);
-u64 blkg_stat_recursive_sum(struct blkcg_gq *blkg,
- struct blkcg_policy *pol, int off);
-struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg,
- struct blkcg_policy *pol, int off);
+void blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, struct blkcg_policy *pol,
+ int off, struct blkg_rwstat_sample *sum);
struct blkg_conf_ctx {
struct gendisk *disk;
if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
(p_blkg)->q, false)))
-static inline int blkg_stat_init(struct blkg_stat *stat, gfp_t gfp)
-{
- int ret;
-
- ret = percpu_counter_init(&stat->cpu_cnt, 0, gfp);
- if (ret)
- return ret;
-
- atomic64_set(&stat->aux_cnt, 0);
- return 0;
-}
-
-static inline void blkg_stat_exit(struct blkg_stat *stat)
-{
- percpu_counter_destroy(&stat->cpu_cnt);
-}
-
-/**
- * blkg_stat_add - add a value to a blkg_stat
- * @stat: target blkg_stat
- * @val: value to add
- *
- * Add @val to @stat. The caller must ensure that IRQ on the same CPU
- * don't re-enter this function for the same counter.
- */
-static inline void blkg_stat_add(struct blkg_stat *stat, uint64_t val)
-{
- percpu_counter_add_batch(&stat->cpu_cnt, val, BLKG_STAT_CPU_BATCH);
-}
-
-/**
- * blkg_stat_read - read the current value of a blkg_stat
- * @stat: blkg_stat to read
- */
-static inline uint64_t blkg_stat_read(struct blkg_stat *stat)
-{
- return percpu_counter_sum_positive(&stat->cpu_cnt);
-}
-
-/**
- * blkg_stat_reset - reset a blkg_stat
- * @stat: blkg_stat to reset
- */
-static inline void blkg_stat_reset(struct blkg_stat *stat)
-{
- percpu_counter_set(&stat->cpu_cnt, 0);
- atomic64_set(&stat->aux_cnt, 0);
-}
-
-/**
- * blkg_stat_add_aux - add a blkg_stat into another's aux count
- * @to: the destination blkg_stat
- * @from: the source
- *
- * Add @from's count including the aux one to @to's aux count.
- */
-static inline void blkg_stat_add_aux(struct blkg_stat *to,
- struct blkg_stat *from)
-{
- atomic64_add(blkg_stat_read(from) + atomic64_read(&from->aux_cnt),
- &to->aux_cnt);
-}
-
static inline int blkg_rwstat_init(struct blkg_rwstat *rwstat, gfp_t gfp)
{
int i, ret;
*
* Read the current snapshot of @rwstat and return it in the aux counts.
*/
-static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
+static inline void blkg_rwstat_read(struct blkg_rwstat *rwstat,
+ struct blkg_rwstat_sample *result)
{
- struct blkg_rwstat result;
int i;
for (i = 0; i < BLKG_RWSTAT_NR; i++)
- atomic64_set(&result.aux_cnt[i],
- percpu_counter_sum_positive(&rwstat->cpu_cnt[i]));
- return result;
+ result->cnt[i] =
+ percpu_counter_sum_positive(&rwstat->cpu_cnt[i]);
}
/**
*/
static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
{
- struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
+ struct blkg_rwstat_sample tmp = { };
- return atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) +
- atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]);
+ blkg_rwstat_read(rwstat, &tmp);
+ return tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE];
}
/**
bool blk_mq_complete_request(struct request *rq);
void blk_mq_complete_request_sync(struct request *rq);
bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
- struct bio *bio);
+ struct bio *bio, unsigned int nr_segs);
bool blk_mq_queue_stopped(struct request_queue *q);
void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);
blk_status_t bi_status;
u8 bi_partno;
- /* Number of segments in this BIO after
- * physical address coalescing is performed.
- */
- unsigned int bi_phys_segments;
-
struct bvec_iter bi_iter;
atomic_t __bi_remaining;
*/
enum {
BIO_NO_PAGE_REF, /* don't put release vec pages */
- BIO_SEG_VALID, /* bi_phys_segments valid */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
BIO_USER_MAPPED, /* contains user pages */
unsigned int cmd_flags; /* op and common flags */
req_flags_t rq_flags;
+ int tag;
int internal_tag;
/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
- int tag;
sector_t __sector; /* sector cursor */
struct bio *bio;
extern blk_qc_t generic_make_request(struct bio *bio);
extern blk_qc_t direct_make_request(struct bio *bio);
extern void blk_rq_init(struct request_queue *q, struct request *rq);
-extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
extern void blk_put_request(struct request *);
extern struct request *blk_get_request(struct request_queue *, unsigned int op,
blk_mq_req_flags_t flags);
struct request *rq);
extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
extern void blk_queue_split(struct request_queue *, struct bio **);
-extern void blk_recount_segments(struct request_queue *, struct bio *);
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
unsigned int, void __user *);
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
struct request *, int, rq_end_io_fn *);
+/* Helper to convert REQ_OP_XXX to its string format XXX */
+extern const char *blk_op_str(unsigned int op);
+
int blk_status_to_errno(blk_status_t status);
blk_status_t errno_to_blk_status(int errno);
*
* blk_update_request() completes given number of bytes and updates
* the request without completing it.
- *
- * blk_end_request() and friends. __blk_end_request() must be called
- * with the request queue spinlock acquired.
- *
- * Several drivers define their own end_request and call
- * blk_end_request() for parts of the original function.
- * This prevents code duplication in drivers.
*/
extern bool blk_update_request(struct request *rq, blk_status_t error,
unsigned int nr_bytes);
-extern void blk_end_request_all(struct request *rq, blk_status_t error);
-extern bool __blk_end_request(struct request *rq, blk_status_t error,
- unsigned int nr_bytes);
-extern void __blk_end_request_all(struct request *rq, blk_status_t error);
-extern bool __blk_end_request_cur(struct request *rq, blk_status_t error);
extern void __blk_complete_request(struct request *);
extern void blk_abort_request(struct request *);
void (*depth_updated)(struct blk_mq_hw_ctx *);
bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
- bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+ bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *, unsigned int);
int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);
void (*requests_merged)(struct request_queue *, struct request *, struct request *);
* nvmefc_tgt_fcp_req.
* Entrypoint is Optional.
*
+ * @discovery_event: Called by the transport to generate an RSCN
+ * change notifications to NVME initiators. The RSCN notifications
+ * should cause the initiator to rescan the discovery controller
+ * on the targetport.
+ *
* @max_hw_queues: indicates the maximum number of hw queues the LLDD
* supports for cpu affinitization.
* Value is Mandatory. Must be at least 1.
struct nvmefc_tgt_fcp_req *fcpreq);
void (*defer_rcv)(struct nvmet_fc_target_port *tgtport,
struct nvmefc_tgt_fcp_req *fcpreq);
+ void (*discovery_event)(struct nvmet_fc_target_port *tgtport);
u32 max_hw_queues;
u16 max_sgl_segments;
nvme_cmd_resv_release = 0x15,
};
+#define nvme_opcode_name(opcode) { opcode, #opcode }
+#define show_nvm_opcode_name(val) \
+ __print_symbolic(val, \
+ nvme_opcode_name(nvme_cmd_flush), \
+ nvme_opcode_name(nvme_cmd_write), \
+ nvme_opcode_name(nvme_cmd_read), \
+ nvme_opcode_name(nvme_cmd_write_uncor), \
+ nvme_opcode_name(nvme_cmd_compare), \
+ nvme_opcode_name(nvme_cmd_write_zeroes), \
+ nvme_opcode_name(nvme_cmd_dsm), \
+ nvme_opcode_name(nvme_cmd_resv_register), \
+ nvme_opcode_name(nvme_cmd_resv_report), \
+ nvme_opcode_name(nvme_cmd_resv_acquire), \
+ nvme_opcode_name(nvme_cmd_resv_release))
+
+
/*
* Descriptor subtype - lower 4 bits of nvme_(keyed_)sgl_desc identifier
*
nvme_admin_sanitize_nvm = 0x84,
};
+#define nvme_admin_opcode_name(opcode) { opcode, #opcode }
+#define show_admin_opcode_name(val) \
+ __print_symbolic(val, \
+ nvme_admin_opcode_name(nvme_admin_delete_sq), \
+ nvme_admin_opcode_name(nvme_admin_create_sq), \
+ nvme_admin_opcode_name(nvme_admin_get_log_page), \
+ nvme_admin_opcode_name(nvme_admin_delete_cq), \
+ nvme_admin_opcode_name(nvme_admin_create_cq), \
+ nvme_admin_opcode_name(nvme_admin_identify), \
+ nvme_admin_opcode_name(nvme_admin_abort_cmd), \
+ nvme_admin_opcode_name(nvme_admin_set_features), \
+ nvme_admin_opcode_name(nvme_admin_get_features), \
+ nvme_admin_opcode_name(nvme_admin_async_event), \
+ nvme_admin_opcode_name(nvme_admin_ns_mgmt), \
+ nvme_admin_opcode_name(nvme_admin_activate_fw), \
+ nvme_admin_opcode_name(nvme_admin_download_fw), \
+ nvme_admin_opcode_name(nvme_admin_ns_attach), \
+ nvme_admin_opcode_name(nvme_admin_keep_alive), \
+ nvme_admin_opcode_name(nvme_admin_directive_send), \
+ nvme_admin_opcode_name(nvme_admin_directive_recv), \
+ nvme_admin_opcode_name(nvme_admin_dbbuf), \
+ nvme_admin_opcode_name(nvme_admin_format_nvm), \
+ nvme_admin_opcode_name(nvme_admin_security_send), \
+ nvme_admin_opcode_name(nvme_admin_security_recv), \
+ nvme_admin_opcode_name(nvme_admin_sanitize_nvm))
+
enum {
NVME_QUEUE_PHYS_CONTIG = (1 << 0),
NVME_CQ_IRQ_ENABLED = (1 << 1),
nvme_fabrics_type_property_get = 0x04,
};
+#define nvme_fabrics_type_name(type) { type, #type }
+#define show_fabrics_type_name(type) \
+ __print_symbolic(type, \
+ nvme_fabrics_type_name(nvme_fabrics_type_property_set), \
+ nvme_fabrics_type_name(nvme_fabrics_type_connect), \
+ nvme_fabrics_type_name(nvme_fabrics_type_property_get))
+
+/*
+ * If not fabrics command, fctype will be ignored.
+ */
+#define show_opcode_name(qid, opcode, fctype) \
+ ((opcode) == nvme_fabrics_command ? \
+ show_fabrics_type_name(fctype) : \
+ ((qid) ? \
+ show_nvm_opcode_name(opcode) : \
+ show_admin_opcode_name(opcode)))
+
struct nvmf_common_command {
__u8 opcode;
__u8 resv1;
};
};
+static inline bool nvme_is_fabrics(struct nvme_command *cmd)
+{
+ return cmd->common.opcode == nvme_fabrics_command;
+}
+
struct nvme_error_slot {
__le64 error_count;
__le16 sqid;
*
* Why can't we simply have a Fabrics In and Fabrics out command?
*/
- if (unlikely(cmd->common.opcode == nvme_fabrics_command))
+ if (unlikely(nvme_is_fabrics(cmd)))
return cmd->fabrics.fctype & 1;
return cmd->common.opcode & 1;
}
case IOC_OPAL_ENABLE_DISABLE_MBR:
case IOC_OPAL_ERASE_LR:
case IOC_OPAL_SECURE_ERASE_LR:
+ case IOC_OPAL_PSID_REVERT_TPR:
+ case IOC_OPAL_MBR_DONE:
+ case IOC_OPAL_WRITE_SHADOW_MBR:
return true;
}
return false;
};
enum iter_type {
- /* set if ITER_BVEC doesn't hold a bv_page ref */
- ITER_BVEC_FLAG_NO_REF = 2,
-
/* iter types */
ITER_IOVEC = 4,
ITER_KVEC = 8,
static inline enum iter_type iov_iter_type(const struct iov_iter *i)
{
- return i->type & ~(READ | WRITE | ITER_BVEC_FLAG_NO_REF);
+ return i->type & ~(READ | WRITE);
}
static inline bool iter_is_iovec(const struct iov_iter *i)
return i->type & (READ | WRITE);
}
-static inline bool iov_iter_bvec_no_ref(const struct iov_iter *i)
-{
- return (i->type & ITER_BVEC_FLAG_NO_REF) != 0;
-}
-
/*
* Total number of bytes covered by an iovec.
*
#define show_bio_type(op,op_flags) show_bio_op(op), \
show_bio_op_flags(op_flags)
-#define show_bio_op(op) \
- __print_symbolic(op, \
- { REQ_OP_READ, "READ" }, \
- { REQ_OP_WRITE, "WRITE" }, \
- { REQ_OP_FLUSH, "FLUSH" }, \
- { REQ_OP_DISCARD, "DISCARD" }, \
- { REQ_OP_SECURE_ERASE, "SECURE_ERASE" }, \
- { REQ_OP_ZONE_RESET, "ZONE_RESET" }, \
- { REQ_OP_WRITE_SAME, "WRITE_SAME" }, \
- { REQ_OP_WRITE_ZEROES, "WRITE_ZEROES" })
+#define show_bio_op(op) blk_op_str(op)
#define show_bio_op_flags(flags) \
__print_flags(F2FS_BIO_FLAG_MASK(flags), "|", \
OPAL_MBR_DISABLE = 0x01,
};
+enum opal_mbr_done_flag {
+ OPAL_MBR_NOT_DONE = 0x0,
+ OPAL_MBR_DONE = 0x01
+};
+
enum opal_user {
OPAL_ADMIN1 = 0x0,
OPAL_USER1 = 0x01,
__u8 __align[7];
};
+struct opal_mbr_done {
+ struct opal_key key;
+ __u8 done_flag;
+ __u8 __align[7];
+};
+
+struct opal_shadow_mbr {
+ struct opal_key key;
+ const __u64 data;
+ __u64 offset;
+ __u64 size;
+};
+
#define IOC_OPAL_SAVE _IOW('p', 220, struct opal_lock_unlock)
#define IOC_OPAL_LOCK_UNLOCK _IOW('p', 221, struct opal_lock_unlock)
#define IOC_OPAL_TAKE_OWNERSHIP _IOW('p', 222, struct opal_key)
#define IOC_OPAL_ENABLE_DISABLE_MBR _IOW('p', 229, struct opal_mbr_data)
#define IOC_OPAL_ERASE_LR _IOW('p', 230, struct opal_session_info)
#define IOC_OPAL_SECURE_ERASE_LR _IOW('p', 231, struct opal_session_info)
+#define IOC_OPAL_PSID_REVERT_TPR _IOW('p', 232, struct opal_key)
+#define IOC_OPAL_MBR_DONE _IOW('p', 233, struct opal_mbr_done)
+#define IOC_OPAL_WRITE_SHADOW_MBR _IOW('p', 234, struct opal_shadow_mbr)
#endif /* _UAPI_SED_OPAL_H */
See Documentation/cgroup-v1/blkio-controller.rst for more information.
-config DEBUG_BLK_CGROUP
- bool "IO controller debugging"
- depends on BLK_CGROUP
- default n
- ---help---
- Enable some debugging help. Currently it exports additional stat
- files in a cgroup which can be useful for debugging.
-
config CGROUP_WRITEBACK
bool
depends on MEMCG && BLK_CGROUP
return NULL;
}
+EXPORT_SYMBOL_GPL(css_next_descendant_pre);
/**
* css_rightmost_descendant - return the rightmost descendant of a css
/*
* First get a stable cleared mask, setting the old mask to 0.
*/
- do {
- mask = sb->map[index].cleared;
- } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+ mask = xchg(&sb->map[index].cleared, 0);
/*
* Now clear the masked bits in our free word
struct sbq_wait_state *ws = &sbq->ws[wake_index];
if (waitqueue_active(&ws->wait)) {
- int o = atomic_read(&sbq->wake_index);
-
- if (wake_index != o)
- atomic_cmpxchg(&sbq->wake_index, o, wake_index);
+ if (wake_index != atomic_read(&sbq->wake_index))
+ atomic_set(&sbq->wake_index, wake_index);
return ws;
}