There was a race condition:
In a situation with a SyncSource+Primary and a SyncTarget+Secondary node,
and a resync dependency to some other device. After both nodes decided
to do the resync, the other device finishes its resync process.
At that time SyncSource already sent the P_SYNC_UUID packet, and
already updated its peer disk state to Inconsistent.
The SyncTarget node waits for the P_SYNC_UUID and sends a state packet
to report the resync dependency change. That packet still carries
a disk state of Outdated.
Impact:
If application writes come in, during that time on the Primary node,
those do not get replicated, and the out-of-sync counter gets increased.
=> The completion of resync is not detected on the primary node.
=> stalled.
Those blocks get resync'ed with the next resync, since the are get
marked as out-of-sync in the bitmap.
In order to fix this, we filter out that wrong state change in the
sanitize_state() function.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
break;
case C_WF_BITMAP_S:
case C_PAUSED_SYNC_S:
- ns.pdsk = D_OUTDATED;
+ ns.pdsk = os.pdsk > D_OUTDATED ? D_OUTDATED : os.pdsk;
break;
case C_SYNC_SOURCE:
ns.pdsk = D_INCONSISTENT;