MD: fix lock contention for flush bios
There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.
Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.
This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:
=================Without the patch============================
Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 11165 167.595 5879.560
Close 107469 1.391 2231.094
LockX 384 0.003 0.019
Rename 5944 2.141 1856.001
ReadX 208121 0.003 0.074
WriteX 98259 1925.402 15204.895
Unlink 25198 13.264 3457.268
UnlockX 384 0.001 0.009
FIND_FIRST 47111 0.012 0.076
SET_FILE_INFORMATION 12966 0.007 0.065
QUERY_FILE_INFORMATION 27921 0.004 0.085
QUERY_PATH_INFORMATION 124650 0.005 5.766
QUERY_FS_INFORMATION 22519 0.003 0.053
NTCreateX 141086 4.291 2502.812
Throughput 3.7181 MB/sec (sync open) 128 clients 128 procs max_latency=15204.905 ms
=================With the patch============================
Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 4500 174.134 406.398
Close 48195 0.060 467.062
LockX 256 0.003 0.029
Rename 2324 0.026 0.360
ReadX 78846 0.004 0.504
WriteX 66832 562.775 1467.037
Unlink 5516 3.665 1141.740
UnlockX 256 0.002 0.019
FIND_FIRST 16428 0.015 0.313
SET_FILE_INFORMATION 6400 0.009 0.520
QUERY_FILE_INFORMATION 17865 0.003 0.089
QUERY_PATH_INFORMATION 47060 0.078 416.299
QUERY_FS_INFORMATION 7024 0.004 0.032
NTCreateX 55921 0.854 1141.452
Throughput 11.744 MB/sec (sync open) 128 clients 128 procs max_latency=1467.041 ms
The test is done on raid1 disk with two rotational disks
V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version
V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.
Fix a bug from v2. It should set flush_pending to 1 at first.
V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe. It is only safe to call rdev_dec_pending() on rdevs
that you previously called
atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.
Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>