md/raid6: Fix anomily when recovering a single device in RAID6.
authorNeilBrown <neilb@suse.com>
Mon, 3 Apr 2017 02:11:32 +0000 (12:11 +1000)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Thu, 22 Mar 2018 08:17:50 +0000 (09:17 +0100)
[ Upstream commit 7471fb77ce4dc4cb81291189947fcdf621a97987 ]

When recoverying a single missing/failed device in a RAID6,
those stripes where the Q block is on the missing device are
handled a bit differently.  In these cases it is easy to
check that the P block is correct, so we do.  This results
in the P block be destroy.  Consequently the P block needs
to be read a second time in order to compute Q.  This causes
lots of seeks and hurts performance.

It shouldn't be necessary to re-read P as it can be computed
from the DATA.  But we only compute blocks on missing
devices, since c337869d9501 ("md: do not compute parity
unless it is on a failed drive").

So relax the change made in that commit to allow computing
of the P block in a RAID6 which it is the only missing that
block.

This makes RAID6 recovery run much faster as the disk just
"before" the recovering device is no longer seeking
back-and-forth.

Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
drivers/md/raid5.c

index 475a7a1..4493be5 100644 (file)
@@ -3391,9 +3391,20 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
                BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
                BUG_ON(test_bit(R5_Wantread, &dev->flags));
                BUG_ON(sh->batch_head);
+
+               /*
+                * In the raid6 case if the only non-uptodate disk is P
+                * then we already trusted P to compute the other failed
+                * drives. It is safe to compute rather than re-read P.
+                * In other cases we only compute blocks from failed
+                * devices, otherwise check/repair might fail to detect
+                * a real inconsistency.
+                */
+
                if ((s->uptodate == disks - 1) &&
+                   ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
                    (s->failed && (disk_idx == s->failed_num[0] ||
-                                  disk_idx == s->failed_num[1]))) {
+                                  disk_idx == s->failed_num[1])))) {
                        /* have disk failed, and we're requested to fetch it;
                         * do compute it
                         */