src/mutex/README

   1 # $Id$
   2
   3 Note: this only applies to locking using test-and-set and fcntl calls,
   4 pthreads were added after this was written.
   5
   6 Resource locking routines: lock based on a DB_MUTEX.  All this gunk
   7 (including trying to make assembly code portable), is necessary because
   8 System V semaphores require system calls for uncontested locks and we
   9 don't want to make two system calls per resource lock.
  10
  11 First, this is how it works.  The DB_MUTEX structure contains a resource
  12 test-and-set lock (tsl), a file offset, a pid for debugging and statistics
  13 information.
  14
  15 If HAVE_MUTEX_FCNTL is NOT defined (that is, we know how to do
  16 test-and-sets for this compiler/architecture combination), we try and
  17 lock the resource tsl some number of times (based on the number of
  18 processors).  If we can't acquire the mutex that way, we use a system
  19 call to sleep for 1ms, 2ms, 4ms, etc.  (The time is bounded at 10ms for
  20 mutexes backing logical locks and 25 ms for data structures, just in
  21 case.)  Using the timer backoff means that there are two assumptions:
  22 that mutexes are held for brief periods (never over system calls or I/O)
  23 and mutexes are not hotly contested.
  24
  25 If HAVE_MUTEX_FCNTL is defined, we use a file descriptor to do byte
  26 locking on a file at a specified offset.  In this case, ALL of the
  27 locking is done in the kernel.  Because file descriptors are allocated
  28 per process, we have to provide the file descriptor as part of the lock
  29 call.  We still have to do timer backoff because we need to be able to
  30 block ourselves, that is, the lock manager causes processes to wait by
  31 having the process acquire a mutex and then attempting to re-acquire the
  32 mutex.  There's no way to use kernel locking to block yourself, that is,
  33 if you hold a lock and attempt to re-acquire it, the attempt will
  34 succeed.
  35
  36 Next, let's talk about why it doesn't work the way a reasonable person
  37 would think it should work.
  38
  39 Ideally, we'd have the ability to try to lock the resource tsl, and if
  40 that fails, increment a counter of waiting processes, then block in the
  41 kernel until the tsl is released.  The process holding the resource tsl
  42 would see the wait counter when it went to release the resource tsl, and
  43 would wake any waiting processes up after releasing the lock.  This would
  44 actually require both another tsl (call it the mutex tsl) and
  45 synchronization between the call that blocks in the kernel and the actual
  46 resource tsl.  The mutex tsl would be used to protect accesses to the
  47 DB_MUTEX itself.  Locking the mutex tsl would be done by a busy loop,
  48 which is safe because processes would never block holding that tsl (all
  49 they would do is try to obtain the resource tsl and set/check the wait
  50 count).  The problem in this model is that the blocking call into the
  51 kernel requires a blocking semaphore, i.e. one whose normal state is
  52 locked.
  53
  54 The only portable forms of locking under UNIX are fcntl(2) on a file
  55 descriptor/offset, and System V semaphores.  Neither of these locking
  56 methods are sufficient to solve the problem.
  57
  58 The problem with fcntl locking is that only the process that obtained the
  59 lock can release it.  Remember, we want the normal state of the kernel
  60 semaphore to be locked.  So, if the creator of the DB_MUTEX were to
  61 initialize the lock to "locked", then a second process locks the resource
  62 tsl, and then a third process needs to block, waiting for the resource
  63 tsl, when the second process wants to wake up the third process, it can't
  64 because it's not the holder of the lock!  For the second process to be
  65 the holder of the lock, we would have to make a system call per
  66 uncontested lock, which is what we were trying to get away from in the
  67 first place.
  68
  69 There are some hybrid schemes, such as signaling the holder of the lock,
  70 or using a different blocking offset depending on which process is
  71 holding the lock, but it gets complicated fairly quickly.  I'm open to
  72 suggestions, but I'm not holding my breath.
  73
  74 Regardless, we use this form of locking when we don't have any other
  75 choice, because it doesn't have the limitations found in System V
  76 semaphores, and because the normal state of the kernel object in that
  77 case is unlocked, so the process releasing the lock is also the holder
  78 of the lock.
  79
  80 The System V semaphore design has a number of other limitations that make
  81 it inappropriate for this task.  Namely:
  82
  83 First, the semaphore key name space is separate from the file system name
  84 space (although there exist methods for using file names to create
  85 semaphore keys).  If we use a well-known key, there's no reason to believe
  86 that any particular key will not already be in use, either by another
  87 instance of the DB application or some other application, in which case
  88 the DB application will fail.  If we create a key, then we have to use a
  89 file system name to rendezvous and pass around the key.
  90
  91 Second, System V semaphores traditionally have compile-time, system-wide
  92 limits on the number of semaphore keys that you can have.  Typically, that
  93 number is far too low for any practical purpose.  Since the semaphores
  94 permit more than a single slot per semaphore key, we could try and get
  95 around that limit by using multiple slots, but that means that the file
  96 that we're using for rendezvous is going to have to contain slot
  97 information as well as semaphore key information, and we're going to be
  98 reading/writing it on every db_mutex_t init or destroy operation.  Anyhow,
  99 similar compile-time, system-wide limits on the numbers of slots per
 100 semaphore key kick in, and you're right back where you started.
 101
 102 My fantasy is that once POSIX.1 standard mutexes are in wide-spread use,
 103 we can switch to them.  My guess is that it won't happen, because the
 104 POSIX semaphores are only required to work for threads within a process,
 105 and not independent processes.
 106
 107 Note: there are races in the statistics code, but since it's just that,
 108 I didn't bother fixing them.  (The fix requires a mutex tsl, so, when/if
 109 this code is fixed to do rational locking (see above), then change the
 110 statistics update code to acquire/release the mutex tsl.