doc/lvm_fault_handling.txt

   1 LVM device fault handling
   2 =========================
   3
   4 Introduction
   5 ------------
   6 This document is to serve as the definitive source for information
   7 regarding the policies and procedures surrounding device failures
   8 in LVM.  It codifies LVM's responses to device failures as well as
   9 the responsibilities of administrators.
  10
  11 Device failures can be permanent or transient.  A permanent failure
  12 is one where a device becomes inaccessible and will never be
  13 revived.  A transient failure is a failure that can be recovered
  14 from (e.g. a power failure, intermittent network outage, block
  15 relocation, etc).  The policies for handling both types of failures
  16 is described herein.
  17
  18 Available Operations During a Device Failure
  19 --------------------------------------------
  20 When there is a device failure, LVM behaves somewhat differently because
  21 only a subset of the available devices will be found for the particular
  22 volume group.  The number of operations available to the administrator
  23 is diminished.  It is not possible to create new logical volumes while
  24 PVs cannot be accessed, for example.  Operations that create, convert, or
  25 resize logical volumes are disallowed, such as:
  26 - lvcreate
  27 - lvresize
  28 - lvreduce
  29 - lvextend
  30 - lvconvert (unless '--repair' is used)
  31 Operations that activate, deactivate, remove, report, or repair logical
  32 volumes are allowed, such as:
  33 - lvremove
  34 - vgremove (will remove all LVs, but not the VG until consistent)
  35 - pvs
  36 - vgs
  37 - lvs
  38 - lvchange -a [yn]
  39 - vgchange -a [yn]
  40 Operations specific to the handling of failed devices are allowed and
  41 are as follows:
  42
  43 - 'vgreduce --removemissing <VG>':  This action is designed to remove
  44   the reference of a failed device from the LVM metadata stored on the
  45   remaining devices.  If there are (portions of) logical volumes on the
  46   failed devices, the ability of the operation to proceed will depend
  47   on the type of logical volumes found.  If an image (i.e leg or side)
  48   of a mirror is located on the device, that image/leg of the mirror
  49   is eliminated along with the failed device.  The result of such a
  50   mirror reduction could be a no-longer-redundant linear device.  If
  51   a linear, stripe, or snapshot device is located on the failed device
  52   the command will not proceed without a '--force' option.  The result
  53   of using the '--force' option is the entire removal and complete
  54   loss of the non-redundant logical volume.  Once this operation is
  55   complete, the volume group will again have a complete and consistent
  56   view of the devices it contains.  Thus, all operations will be
  57   permitted - including creation, conversion, and resizing operations.
  58
  59 - 'lvconvert --repair <VG/LV>':  This action is designed specifically
  60   to operate on mirrored logical volumes.  It is used on logical volumes
  61   individually and does not remove the faulty device from the volume
  62   group.  If, for example, a failed device happened to contain the
  63   images of four distinct mirrors, it would be necessary to run
  64   'lvconvert --repair' on each of them.  The ultimate result is to leave
  65   the faulty device in the volume group, but have no logical volumes
  66   referencing it.  In addition to removing mirror images that reside
  67   on failed devices, 'lvconvert --repair' can also replace the failed
  68   device if there are spare devices available in the volume group.  The
  69   user is prompted whether to simply remove the failed portions of the
  70   mirror or to also allocate a replacement, if run from the command-line.
  71   Optionally, the '--use-policies' flag can be specified which will
  72   cause the operation not to prompt the user, but instead respect
  73   the policies outlined in the LVM configuration file - usually,
  74   /etc/lvm/lvm.conf.  Once this operation is complete, mirrored logical
  75   volumes will be consistent and I/O will be allowed to continue.
  76   However, the volume group will still be inconsistent -  due to the
  77   refernced-but-missing device/PV - and operations will still be
  78   restricted to the aformentioned actions until either the device is
  79   restored or 'vgreduce --removemissing' is run.
  80
  81 Device Revival (transient failures):
  82 ------------------------------------
  83 During a device failure, the above section describes what limitations
  84 a user can expect.  However, if the device returns after a period of
  85 time, what to expect will depend on what has happened during the time
  86 period when the device was failed.  If no automated actions (described
  87 below) or user actions were necessary or performed, then no change in
  88 operations or logical volume layout will occur.  However, if an
  89 automated action or one of the aforementioned repair commands was
  90 manually run, the returning device will be perceived as having stale
  91 LVM metadata.  In this case, the user can expect to see a warning
  92 concerning inconsistent metadata.  The metadata on the returning
  93 device will be automatically replaced with the latest copy of the
  94 LVM metadata - restoring consistency.  Note, while most LVM commands
  95 will automatically update the metadata on a restored devices, the
  96 following possible exceptions exist:
  97 - pvs (when it does not read/update VG metadata)
  98
  99 Automated Target Response to Failures:
 100 --------------------------------------
 101 The only LVM target type (i.e. "personality") that has an automated
 102 response to failures is a mirrored logical volume.  The other target
 103 types (linear, stripe, snapshot, etc) will simply propagate the failure.
 104 [A snapshot becomes invalid if its underlying device fails, but the
 105 origin will remain valid - presuming the origin device has not failed.]
 106 There are three types of errors that a mirror can suffer - read, write,
 107 and resynchronization errors.  Each is described in depth below.
 108
 109 Mirror read failures:
 110 If a mirror is 'in-sync' (i.e. all images have been initialized and
 111 are identical), a read failure will only produce a warning.  Data is
 112 simply pulled from one of the other images and the fault is recorded.
 113 Sometimes - like in the case of bad block relocation - read errors can
 114 be recovered from by the storage hardware.  Therefore, it is up to the
 115 user to decide whether to reconfigure the mirror and remove the device
 116 that caused the error.  Managing the composition of a mirror is done with
 117 'lvconvert' and removing a device from a volume group can be done with
 118 'vgreduce'.
 119
 120 If a mirror is not 'in-sync', a read failure will produce an I/O error.
 121 This error will propagate all the way up to the applications above the
 122 logical volume (e.g. the file system).  No automatic intervention will
 123 take place in this case either.  It is up to the user to decide what
 124 can be done/salvaged in this senario.  If the user is confident that the
 125 images of the mirror are the same (or they are willing to simply attempt
 126 to retreive whatever data they can), 'lvconvert' can be used to eliminate
 127 the failed image and proceed.
 128
 129 Mirror resynchronization errors:
 130 A resynchronization error is one that occurs when trying to initialize
 131 all mirror images to be the same.  It can happen due to a failure to
 132 read the primary image (the image considered to have the 'good' data), or
 133 due to a failure to write the secondary images.  This type of failure
 134 only produces a warning, and it is up to the user to take action in this
 135 case.  If the error is transient, the user can simply reactivate the
 136 mirrored logical volume to make another attempt at resynchronization.
 137 If attempts to finish resynchronization fail, 'lvconvert' can be used to
 138 remove the faulty device from the mirror.
 139
 140 TODO...
 141 Some sort of response to this type of error could be automated.
 142 Since this document is the definitive source for how to handle device
 143 failures, the process should be defined here.  If the process is defined
 144 but not implemented, it should be noted as such.  One idea might be to
 145 make a single attempt to suspend/resume the mirror in an attempt to
 146 redo the sync operation that failed.  On the other hand, if there is
 147 a permanent failure, it may simply be best to wait for the user or the
 148 automated response that is sure to follow from a write failure.
 149 ...TODO
 150
 151 Mirror write failures:
 152 When a write error occurs on a mirror constituent device, an attempt
 153 to handle the failure is automatically made.  This is done by calling
 154 'lvconvert --repair --use-policies'.  The policies implied by this
 155 command are set in the LVM configuration file.  They are:
 156 - mirror_log_fault_policy:  This defines what action should be taken
 157   if the device containing the log fails.  The available options are
 158   "remove" and "allocate".  Either of these options will cause the
 159   faulty log device to be removed from the mirror.  The "allocate"
 160   policy will attempt the further action of trying to replace the
 161   failed disk log by using space that might be available in the
 162   volume group.  If the allocation fails (or the "remove" policy
 163   is specified), the mirror log will be maintained in memory.  Should
 164   the machine be rebooted or the logical volume deactivated, a
 165   complete resynchronization of the mirror will be necessary upon
 166   the follow activation - such is the nature of a mirror with a 'core'
 167   log.  The default policy for handling log failures is "allocate".
 168   The service disruption incurred by replacing the failed log is
 169   negligible, while the benefits of having persistent log is
 170   pronounced.
 171 - mirror_image_fault_policy:  This defines what action should be taken
 172   if a device containing an image fails.  Again, the available options
 173   are "remove" and "allocate".  Both of these options will cause the
 174   faulty image device to be removed - adjusting the logical volume
 175   accordingly.  For example, if one image of a 2-way mirror fails, the
 176   mirror will be converted to a linear device.  If one image of a
 177   3-way mirror fails, the mirror will be converted to a 2-way mirror.
 178   The "allocate" policy takes the further action of trying to replace
 179   the failed image using space that is available in the volume group.
 180   Replacing a failed mirror image will incure the cost of
 181   resynchronizing - degrading the performance of the mirror.  The
 182   default policy for handling an image failure is "remove".  This
 183   allows the mirror to still function, but gives the administrator the
 184   choice of when to incure the extra performance costs of replacing
 185   the failed image.
 186
 187 TODO...
 188 The appropriate time to take permanent corrective action on a mirror
 189 should be driven by policy.  There should be a directive that takes
 190 a time or percentage argument.  Something like the following:
 191 - mirror_fault_policy_WHEN = "10sec"/"10%"
 192 A time value would signal the amount of time to wait for transient
 193 failures to resolve themselves.  The percentage value would signal the
 194 amount a mirror could become out-of-sync before the faulty device is
 195 removed.
 196
 197 A mirror cannot be used unless /some/ corrective action is taken,
 198 however.  One option is to replace the failed mirror image with an
 199 error target, forgo the use of 'handle_errors', and simply let the
 200 out-of-sync regions accumulate and be tracked by the log.  Mirrors
 201 that have more than 2 images would have to "stack" to perform the
 202 tracking, as each failed image would have to be associated with a
 203 log.  If the failure is transient, the device would replace the
 204 error target that was holding its spot and the log that was tracking
 205 the deltas would be used to quickly restore the portions that changed.
 206
 207 One unresolved issue with the above scheme is how to know which
 208 regions of the mirror are out-of-sync when a problem occurs.  When
 209 a write failure occurs in the kernel, the log will contain those
 210 regions that are not in-sync.  If the log is a disk log, that log
 211 could continue to be used to track differences.  However, if the
 212 log was a core log - or if the log device failed at the same time
 213 as an image device - there would be no way to determine which
 214 regions are out-of-sync to begin with as we start to track the
 215 deltas for the failed image.  I don't have a solution for this
 216 problem other than to only be able to handle errors in this way
 217 if conditions are right.  These issues will have to be ironed out
 218 before proceeding.  This could be another case, where it is better
 219 to handle failures in the kernel by allowing the kernel to store
 220 updates in various metadata areas.
 221 ...TODO