1 .. SPDX-License-Identifier: GPL-2.0
10 3) Setting mount states
21 Consider the following situation:
23 A process wants to clone its own namespace, but still wants to access the CD
24 that got mounted recently. Shared subtree semantics provide the necessary
25 mechanism to accomplish the above.
27 It provides the necessary building blocks for features like per-user-namespace
28 and versioned filesystem.
33 Shared subtree provides four different flavors of mounts; struct vfsmount to be
42 2a) A shared mount can be replicated to as many mountpoints and all the
43 replicas continue to be exactly same.
47 Let's say /mnt has a mount that is shared::
49 mount --make-shared /mnt
51 Note: mount(8) command now supports the --make-shared flag,
52 so the sample 'smount' program is no longer needed and has been
57 # mount --bind /mnt /tmp
59 The above command replicates the mount at /mnt to the mountpoint /tmp
60 and the contents of both the mounts remain identical.
70 Now let's say we mount a device at /tmp/a::
72 # mount /dev/sd0 /tmp/a
80 Note that the mount has propagated to the mount at /mnt as well.
82 And the same is true even when /dev/sd0 is mounted on /mnt/a. The
83 contents will be visible under /tmp/a too.
86 2b) A slave mount is like a shared mount except that mount and umount events
87 only propagate towards it.
89 All slave mounts have a master mount which is a shared.
93 Let's say /mnt has a mount which is shared.
94 # mount --make-shared /mnt
96 Let's bind mount /mnt to /tmp
97 # mount --bind /mnt /tmp
99 the new mount at /tmp becomes a shared mount and it is a replica of
102 Now let's make the mount at /tmp; a slave of /mnt
103 # mount --make-slave /tmp
105 let's mount /dev/sd0 on /mnt/a
106 # mount /dev/sd0 /mnt/a
114 Note the mount event has propagated to the mount at /tmp
116 However let's see what happens if we mount something on the mount at /tmp
118 # mount /dev/sd1 /tmp/b
125 Note how the mount event has not propagated to the mount at
129 2c) A private mount does not forward or receive propagation.
131 This is the mount we are familiar with. Its the default type.
134 2d) A unbindable mount is a unbindable private mount
136 let's say we have a mount at /mnt and we make it unbindable::
138 # mount --make-unbindable /mnt
140 Let's try to bind mount this mount somewhere else::
142 # mount --bind /mnt /tmp
143 mount: wrong fs type, bad option, bad superblock on /mnt,
144 or too many mounted file systems
146 Binding a unbindable mount is a invalid operation.
149 3) Setting mount states
151 The mount command (util-linux package) can be used to set mount
154 mount --make-shared mountpoint
155 mount --make-slave mountpoint
156 mount --make-private mountpoint
157 mount --make-unbindable mountpoint
163 A) A process wants to clone its own namespace, but still wants to
164 access the CD that got mounted recently.
168 The system administrator can make the mount at /cdrom shared::
170 mount --bind /cdrom /cdrom
171 mount --make-shared /cdrom
173 Now any process that clones off a new namespace will have a
174 mount at /cdrom which is a replica of the same mount in the
177 So when a CD is inserted and mounted at /cdrom that mount gets
178 propagated to the other mount at /cdrom in all the other clone
181 B) A process wants its mounts invisible to any other process, but
182 still be able to see the other system mounts.
186 To begin with, the administrator can mark the entire mount tree
189 mount --make-rshared /
191 A new process can clone off a new namespace. And mark some part
192 of its namespace as slave::
194 mount --make-rslave /myprivatetree
196 Hence forth any mounts within the /myprivatetree done by the
197 process will not show up in any other namespace. However mounts
198 done in the parent namespace under /myprivatetree still shows
199 up in the process's namespace.
202 Apart from the above semantics this feature provides the
203 building blocks to solve the following problems:
205 C) Per-user namespace
207 The above semantics allows a way to share mounts across
208 namespaces. But namespaces are associated with processes. If
209 namespaces are made first class objects with user API to
210 associate/disassociate a namespace with userid, then each user
211 could have his/her own namespace and tailor it to his/her
212 requirements. This needs to be supported in PAM.
216 If the entire mount tree is visible at multiple locations, then
217 an underlying versioning file system can return different
218 versions of the file depending on the path used to access that
223 mount --make-shared /
224 mount --rbind / /view/v1
225 mount --rbind / /view/v2
226 mount --rbind / /view/v3
227 mount --rbind / /view/v4
229 and if /usr has a versioning filesystem mounted, then that
230 mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
233 A user can request v3 version of the file /usr/fs/namespace.c
234 by accessing /view/v3/usr/fs/namespace.c . The underlying
235 versioning filesystem can then decipher that v3 version of the
236 filesystem is being requested and return the corresponding
239 5) Detailed semantics
240 ---------------------
241 The section below explains the detailed semantics of
242 bind, rbind, move, mount, umount and clone-namespace operations.
244 Note: the word 'vfsmount' and the noun 'mount' have been used
245 to mean the same thing, throughout this document.
249 A given mount can be in one of the following states
257 A 'propagation event' is defined as event generated on a vfsmount
258 that leads to mount or unmount actions in other vfsmounts.
260 A 'peer group' is defined as a group of vfsmounts that propagate
261 events to each other.
265 A 'shared mount' is defined as a vfsmount that belongs to a
270 mount --make-shared /mnt
271 mount --bind /mnt /tmp
273 The mount at /mnt and that at /tmp are both shared and belong
274 to the same peer group. Anything mounted or unmounted under
275 /mnt or /tmp reflect in all the other mounts of its peer
281 A 'slave mount' is defined as a vfsmount that receives
282 propagation events and does not forward propagation events.
284 A slave mount as the name implies has a master mount from which
285 mount/unmount events are received. Events do not propagate from
286 the slave mount to the master. Only a shared mount can be made
287 a slave by executing the following command::
289 mount --make-slave mount
291 A shared mount that is made as a slave is no more shared unless
292 modified to become shared.
296 A vfsmount can be both shared as well as slave. This state
297 indicates that the mount is a slave of some vfsmount, and
298 has its own peer group too. This vfsmount receives propagation
299 events from its master vfsmount, and also forwards propagation
300 events to its 'peer group' and to its slave vfsmounts.
302 Strictly speaking, the vfsmount is shared having its own
303 peer group, and this peer-group is a slave of some other
306 Only a slave vfsmount can be made as 'shared and slave' by
307 either executing the following command::
309 mount --make-shared mount
311 or by moving the slave vfsmount under a shared vfsmount.
315 A 'private mount' is defined as vfsmount that does not
316 receive or forward any propagation events.
320 A 'unbindable mount' is defined as vfsmount that does not
321 receive or forward any propagation events and cannot
327 The state diagram below explains the state transition of a mount,
328 in response to various commands::
330 -----------------------------------------------------------------------
331 | |make-shared | make-slave | make-private |make-unbindab|
332 --------------|------------|--------------|--------------|-------------|
333 |shared |shared |*slave/private| private | unbindable |
335 |-------------|------------|--------------|--------------|-------------|
336 |slave |shared | **slave | private | unbindable |
338 |-------------|------------|--------------|--------------|-------------|
339 |shared |shared | slave | private | unbindable |
340 |and slave |and slave | | | |
341 |-------------|------------|--------------|--------------|-------------|
342 |private |shared | **private | private | unbindable |
343 |-------------|------------|--------------|--------------|-------------|
344 |unbindable |shared |**unbindable | private | unbindable |
345 ------------------------------------------------------------------------
347 * if the shared mount is the only mount in its peer group, making it
348 slave, makes it private automatically. Note that there is no master to
349 which it can be slaved to.
351 ** slaving a non-shared mount has no effect on the mount.
353 Apart from the commands listed below, the 'move' operation also changes
354 the state of a mount depending on type of the destination mount. Its
355 explained in section 5d.
359 Consider the following command::
363 where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
364 is the destination mount and 'b' is the dentry in the destination mount.
366 The outcome depends on the type of mount of 'A' and 'B'. The table
367 below contains quick reference::
369 --------------------------------------------------------------------------
370 | BIND MOUNT OPERATION |
371 |************************************************************************|
372 |source(A)->| shared | private | slave | unbindable |
376 |************************************************************************|
377 | shared | shared | shared | shared & slave | invalid |
379 |non-shared| shared | private | slave | invalid |
380 **************************************************************************
384 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
385 which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
386 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
387 are created and mounted at the dentry 'b' on all mounts where 'B'
388 propagates to. A new propagation tree containing 'C1',..,'Cn' is
389 created. This propagation tree is identical to the propagation tree of
390 'B'. And finally the peer-group of 'C' is merged with the peer group
393 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
394 which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
395 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
396 are created and mounted at the dentry 'b' on all mounts where 'B'
397 propagates to. A new propagation tree is set containing all new mounts
398 'C', 'C1', .., 'Cn' with exactly the same configuration as the
399 propagation tree for 'B'.
401 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
402 mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
403 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
404 'C3' ... are created and mounted at the dentry 'b' on all mounts where
405 'B' propagates to. A new propagation tree containing the new mounts
406 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
407 propagation tree for 'B'. And finally the mount 'C' and its peer group
408 is made the slave of mount 'Z'. In other words, mount 'C' is in the
409 state 'slave and shared'.
411 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
414 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
415 unbindable) mount. A new mount 'C' which is clone of 'A', is created.
416 Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
418 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
419 which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
420 mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
423 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
424 new mount 'C' which is a clone of 'A' is created. Its root dentry is
425 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
426 slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
427 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
428 mount/unmount on 'A' do not propagate anywhere else. Similarly
429 mount/unmount on 'C' do not propagate anywhere else.
431 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
432 invalid operation. A unbindable mount cannot be bind mounted.
436 rbind is same as bind. Bind replicates the specified mount. Rbind
437 replicates all the mounts in the tree belonging to the specified mount.
438 Rbind mount is bind mount applied to all the mounts in the tree.
440 If the source tree that is rbind has some unbindable mounts,
441 then the subtree under the unbindable mount is pruned in the new
446 let's say we have the following mount tree::
454 Let's say all the mount except the mount C in the tree are
455 of a type other than unbindable.
457 If this tree is rbound to say Z
459 We will have the following tree at the new location::
465 B' Note how the tree under C is pruned
466 / \ in the new location.
473 Consider the following command
477 where 'A' is the source mount, 'B' is the destination mount and 'b' is
478 the dentry in the destination mount.
480 The outcome depends on the type of the mount of 'A' and 'B'. The table
481 below is a quick reference::
483 ---------------------------------------------------------------------------
484 | MOVE MOUNT OPERATION |
485 |**************************************************************************
486 | source(A)->| shared | private | slave | unbindable |
490 |**************************************************************************
491 | shared | shared | shared |shared and slave| invalid |
493 |non-shared| shared | private | slave | unbindable |
494 ***************************************************************************
496 .. Note:: moving a mount residing under a shared mount is invalid.
500 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
501 mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
502 are created and mounted at dentry 'b' on all mounts that receive
503 propagation from mount 'B'. A new propagation tree is created in the
504 exact same configuration as that of 'B'. This new propagation tree
505 contains all the new mounts 'A1', 'A2'... 'An'. And this new
506 propagation tree is appended to the already existing propagation tree
509 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
510 mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
511 are created and mounted at dentry 'b' on all mounts that receive
512 propagation from mount 'B'. The mount 'A' becomes a shared mount and a
513 propagation tree is created which is identical to that of
514 'B'. This new propagation tree contains all the new mounts 'A1',
517 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
518 mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
519 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
520 receive propagation from mount 'B'. A new propagation tree is created
521 in the exact same configuration as that of 'B'. This new propagation
522 tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
523 propagation tree is appended to the already existing propagation tree of
524 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
527 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
528 is invalid. Because mounting anything on the shared mount 'B' can
529 create new mounts that get mounted on the mounts that receive
530 propagation from 'B'. And since the mount 'A' is unbindable, cloning
531 it to mount at other mountpoints is not possible.
533 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
534 unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
536 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
537 is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
540 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
541 The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
542 continues to be a slave mount of mount 'Z'.
544 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
545 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
550 Consider the following command::
554 'B' is the destination mount and 'b' is the dentry in the destination
557 The above operation is the same as bind operation with the exception
558 that the source mount is always a private mount.
561 5f) Unmount semantics
563 Consider the following command::
567 where 'A' is a mount mounted on mount 'B' at dentry 'b'.
569 If mount 'B' is shared, then all most-recently-mounted mounts at dentry
570 'b' on mounts that receive propagation from mount 'B' and does not have
571 sub-mounts within them are unmounted.
573 Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
576 let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
577 'B1', 'B2' and 'B3' respectively.
579 let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
580 mount 'B1', 'B2' and 'B3' respectively.
582 if 'C1' is unmounted, all the mounts that are most-recently-mounted on
583 'B1' and on the mounts that 'B1' propagates-to are unmounted.
585 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
586 on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
588 So all 'C1', 'C2' and 'C3' should be unmounted.
590 If any of 'C2' or 'C3' has some child mounts, then that mount is not
591 unmounted, but all other mounts are unmounted. However if 'C1' is told
592 to be unmounted and 'C1' has some sub-mounts, the umount operation is
597 A cloned namespace contains all the mounts as that of the parent
600 Let's say 'A' and 'B' are the corresponding mounts in the parent and the
603 If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
606 If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
609 If 'A' is a private mount, then 'B' is a private mount too.
611 If 'A' is unbindable mount, then 'B' is a unbindable mount too.
616 A. What is the result of the following command sequence?
620 mount --bind /mnt /mnt
621 mount --make-shared /mnt
622 mount --bind /mnt /tmp
623 mount --move /tmp /mnt/1
625 what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
626 Should they all be identical? or should /mnt and /mnt/1 be
630 B. What is the result of the following command sequence?
634 mount --make-rshared /
638 what should be the content of /v/1/v/1 be?
641 C. What is the result of the following command sequence?
645 mount --bind /mnt /mnt
646 mount --make-shared /mnt
647 mkdir -p /mnt/1/2/3 /mnt/1/test
648 mount --bind /mnt/1 /tmp
649 mount --make-slave /mnt
650 mount --make-shared /mnt
651 mount --bind /mnt/1/2 /tmp1
652 mount --make-slave /mnt
654 At this point we have the first mount at /tmp and
655 its root dentry is 1. Let's call this mount 'A'
656 And then we have a second mount at /tmp1 with root
657 dentry 2. Let's call this mount 'B'
658 Next we have a third mount at /mnt with root dentry
659 mnt. Let's call this mount 'C'
661 'B' is the slave of 'A' and 'C' is a slave of 'B'
664 at this point if we execute the following command
666 mount --bind /bin /tmp/test
668 The mount is attempted on 'A'
670 will the mount propagate to 'B' and 'C' ?
672 what would be the contents of
677 Q1. Why is bind mount needed? How is it different from symbolic links?
678 symbolic links can get stale if the destination mount gets
679 unmounted or moved. Bind mounts continue to exist even if the
680 other mount is unmounted or moved.
682 Q2. Why can't the shared subtree be implemented using exportfs?
684 exportfs is a heavyweight way of accomplishing part of what
685 shared subtree can do. I cannot imagine a way to implement the
686 semantics of slave mount using exportfs?
688 Q3 Why is unbindable mount needed?
690 Let's say we want to replicate the mount tree at multiple
691 locations within the same subtree.
693 if one rbind mounts a tree within the same subtree 'n' times
694 the number of mounts created is an exponential function of 'n'.
695 Having unbindable mount can help prune the unneeded bind
696 mounts. Here is an example.
699 let's say the root tree has just two directories with
706 And we want to replicate the tree at multiple
707 mountpoints under /root/tmp
713 mount --make-shared /root
717 mount --rbind /root /tmp/m1
719 the new tree now looks like this::
737 mount --rbind /root /tmp/m2
739 the new tree now looks like this::
764 mount --rbind /root /tmp/m3
766 I won't draw the tree..but it has 24 vfsmounts
769 at step i the number of vfsmounts is V[i] = i*V[i-1].
770 This is an exponential function. And this tree has way more
771 mounts than what we really needed in the first place.
773 One could use a series of umount at each step to prune
774 out the unneeded mounts. But there is a better solution.
775 Unclonable mounts come in handy here.
778 let's say the root tree has just two directories with
785 How do we set up the same tree at multiple locations under
792 mount --bind /root/tmp /root/tmp
794 mount --make-rshared /root
795 mount --make-unbindable /root/tmp
799 mount --rbind /root /tmp/m1
801 the new tree now looks like this::
815 mount --rbind /root /tmp/m2
817 the new tree now looks like this::
831 mount --rbind /root /tmp/m3
833 the new tree now looks like this::
841 tmp usr tmp usr tmp usr
847 4 new fields are introduced to struct vfsmount:
855 links together all the mount to/from which this vfsmount
856 send/receives propagation events.
859 links all the mounts to which this vfsmount propagates
863 links together all the slaves that its master vfsmount
867 points to the master vfsmount from which this vfsmount
868 receives propagation.
871 takes two more flags to indicate the propagation status of
872 the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
873 vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
876 All the shared vfsmounts in a peer group form a cyclic list through
879 All vfsmounts with the same ->mnt_master form on a cyclic list anchored
880 in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
882 ->mnt_master can point to arbitrary (and possibly different) members
883 of master peer group. To find all immediate slaves of a peer group
884 you need to go through _all_ ->mnt_slave_list of its members.
885 Conceptually it's just a single set - distribution among the
886 individual lists does not affect propagation or the way propagation
887 tree is modified by operations.
889 All vfsmounts in a peer group have the same ->mnt_master. If it is
890 non-NULL, they form a contiguous (ordered) segment of slave list.
892 A example propagation tree looks as shown in the figure below.
893 [ NOTE: Though it looks like a forest, if we consider all the shared
894 mounts as a conceptual entity called 'pnode', it becomes a tree]::
897 A <--> B <--> C <---> D
905 In the above figure A,B,C and D all are shared and propagate to each
906 other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
907 mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
908 'E' is also shared with 'K' and they propagate to each other. And
909 'K' has 3 slaves 'M', 'L' and 'N'
911 A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
913 A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
915 E's ->mnt_share links with ->mnt_share of K
917 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
919 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
921 K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
923 C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
925 J and K's ->mnt_master points to struct vfsmount of C
927 and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
929 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
932 NOTE: The propagation tree is orthogonal to the mount tree.
936 ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
937 by namespace_sem (exclusive for modifications, shared for reading).
939 Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
940 There are two exceptions: do_add_mount() and clone_mnt().
941 The former modifies a vfsmount that has not been visible in any shared
943 The latter holds namespace_sem and the only references to vfsmount
944 are in lists that can't be traversed without namespace_sem.
948 The crux of the implementation resides in rbind/move operation.
950 The overall algorithm breaks the operation into 3 phases: (look at
951 attach_recursive_mnt() and propagate_mnt())
959 for each mount in the source tree:
961 a) Create the necessary number of mount trees to
962 be attached to each of the mounts that receive
963 propagation from the destination mount.
964 b) Do not attach any of the trees to its destination.
965 However note down its ->mnt_parent and ->mnt_mountpoint
966 c) Link all the new mounts to form a propagation tree that
967 is identical to the propagation tree of the destination
970 If this phase is successful, there should be 'n' new
971 propagation trees; where 'n' is the number of mounts in the
972 source tree. Go to the commit phase
974 Also there should be 'm' new mount trees, where 'm' is
975 the number of mounts to which the destination mount
978 if any memory allocations fail, go to the abort phase.
981 attach each of the mount trees to their corresponding
985 delete all the newly created trees.
988 all the propagation related functionality resides in the file pnode.c
991 ------------------------------------------------------------------------
993 version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
995 version 0.2 (Incorporated comments from Al Viro)