docs: filesystems: convert orangefs.txt to ReST

author Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Mon, 17 Feb 2020 16:12:17 +0000 (17:12 +0100)

committer Jonathan Corbet <corbet@lwn.net>

Mon, 2 Mar 2020 21:04:40 +0000 (14:04 -0700)
author Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Mon, 17 Feb 2020 16:12:17 +0000 (17:12 +0100)
committer Jonathan Corbet <corbet@lwn.net>
Mon, 2 Mar 2020 21:04:40 +0000 (14:04 -0700)
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst

index fbee7717584083c40e295d164a3e52194a7b18c3..fed53f83119284b490c1e85a45142835802c27b6 100644 (file)
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -79,6 +79,7 @@ Documentation for filesystem implementations.
     ocfs2
     ocfs2-online-filecheck
     omfs
+   orangefs
     overlayfs
     virtiofs
     vfat
diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst

new file mode 100644 (file)

index 0000000..7d6d4ca
--- /dev/null
+++ b/Documentation/filesystems/orangefs.rst
@@ -0,0 +1,554 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+ORANGEFS
+========
+
+OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
+for large storage problems faced by HPC, BigData, Streaming Video,
+Genomics, Bioinformatics.
+
+Orangefs, originally called PVFS, was first developed in 1993 by
+Walt Ligon and Eric Blumer as a parallel file system for Parallel
+Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns
+of parallel programs.
+
+Orangefs features include:
+
+  * Distributes file data among multiple file servers
+  * Supports simultaneous access by multiple clients
+  * Stores file data and metadata on servers using local file system
+    and access methods
+  * Userspace implementation is easy to install and maintain
+  * Direct MPI support
+  * Stateless
+
+
+Mailing List Archives
+=====================
+
+http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
+
+
+Mailing List Submissions
+========================
+
+devel@lists.orangefs.org
+
+
+Documentation
+=============
+
+http://www.orangefs.org/documentation/
+
+
+Userspace Filesystem Source
+===========================
+
+http://www.orangefs.org/download
+
+Orangefs versions prior to 2.9.3 would not be compatible with the
+upstream version of the kernel client.
+
+
+Running ORANGEFS On a Single Server
+===================================
+
+OrangeFS is usually run in large installations with multiple servers and
+clients, but a complete filesystem can be run on a single machine for
+development and testing.
+
+On Fedora, install orangefs and orangefs-server::
+
+    dnf -y install orangefs orangefs-server
+
+There is an example server configuration file in
+/etc/orangefs/orangefs.conf.  Change localhost to your hostname if
+necessary.
+
+To generate a filesystem to run xfstests against, see below.
+
+There is an example client configuration file in /etc/pvfs2tab.  It is a
+single line.  Uncomment it and change the hostname if necessary.  This
+controls clients which use libpvfs2.  This does not control the
+pvfs2-client-core.
+
+Create the filesystem::
+
+    pvfs2-server -f /etc/orangefs/orangefs.conf
+
+Start the server::
+
+    systemctl start orangefs-server
+
+Test the server::
+
+    pvfs2-ping -m /pvfsmnt
+
+Start the client.  The module must be compiled in or loaded before this
+point::
+
+    systemctl start orangefs-client
+
+Mount the filesystem::
+
+    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+
+
+Building ORANGEFS on a Single Server
+====================================
+
+Where OrangeFS cannot be installed from distribution packages, it may be
+built from source.
+
+You can omit --prefix if you don't care that things are sprinkled around
+in /usr/local.  As of version 2.9.6, OrangeFS uses Berkeley DB by
+default, we will probably be changing the default to LMDB soon.
+
+::
+
+    ./configure --prefix=/opt/ofs --with-db-backend=lmdb
+
+    make
+
+    make install
+
+Create an orangefs config file::
+
+    /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
+
+Create an /etc/pvfs2tab file::
+
+    echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
+       /etc/pvfs2tab
+
+Create the mount point you specified in the tab file if needed::
+
+    mkdir /pvfsmnt
+
+Bootstrap the server::
+
+    /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
+
+Start the server::
+
+    /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
+
+Now the server should be running. Pvfs2-ls is a simple
+test to verify that the server is running::
+
+    /opt/ofs/bin/pvfs2-ls /pvfsmnt
+
+If stuff seems to be working, load the kernel module and
+turn on the client core::
+
+    /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
+
+Mount your filesystem::
+
+    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+
+
+Running xfstests
+================
+
+It is useful to use a scratch filesystem with xfstests.  This can be
+done with only one server.
+
+Make a second copy of the FileSystem section in the server configuration
+file, which is /etc/orangefs/orangefs.conf.  Change the Name to scratch.
+Change the ID to something other than the ID of the first FileSystem
+section (2 is usually a good choice).
+
+Then there are two FileSystem sections: orangefs and scratch.
+
+This change should be made before creating the filesystem.
+
+::
+
+    pvfs2-server -f /etc/orangefs/orangefs.conf
+
+To run xfstests, create /etc/xfsqa.config::
+
+    TEST_DIR=/orangefs
+    TEST_DEV=tcp://localhost:3334/orangefs
+    SCRATCH_MNT=/scratch
+    SCRATCH_DEV=tcp://localhost:3334/scratch
+
+Then xfstests can be run::
+
+    ./check -pvfs2
+
+
+Options
+=======
+
+The following mount options are accepted:
+
+  acl
+    Allow the use of Access Control Lists on files and directories.
+
+  intr
+    Some operations between the kernel client and the user space
+    filesystem can be interruptible, such as changes in debug levels
+    and the setting of tunable parameters.
+
+  local_lock
+    Enable posix locking from the perspective of "this" kernel. The
+    default file_operations lock action is to return ENOSYS. Posix
+    locking kicks in if the filesystem is mounted with -o local_lock.
+    Distributed locking is being worked on for the future.
+
+
+Debugging
+=========
+
+If you want the debug (GOSSIP) statements in a particular
+source file (inode.c for example) go to syslog::
+
+  echo inode > /sys/kernel/debug/orangefs/kernel-debug
+
+No debugging (the default)::
+
+  echo none > /sys/kernel/debug/orangefs/kernel-debug
+
+Debugging from several source files::
+
+  echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
+
+All debugging::
+
+  echo all > /sys/kernel/debug/orangefs/kernel-debug
+
+Get a list of all debugging keywords::
+
+  cat /sys/kernel/debug/orangefs/debug-help
+
+
+Protocol between Kernel Module and Userspace
+============================================
+
+Orangefs is a user space filesystem and an associated kernel module.
+We'll just refer to the user space part of Orangefs as "userspace"
+from here on out. Orangefs descends from PVFS, and userspace code
+still uses PVFS for function and variable names. Userspace typedefs
+many of the important structures. Function and variable names in
+the kernel module have been transitioned to "orangefs", and The Linux
+Coding Style avoids typedefs, so kernel module structures that
+correspond to userspace structures are not typedefed.
+
+The kernel module implements a pseudo device that userspace
+can read from and write to. Userspace can also manipulate the
+kernel module through the pseudo device with ioctl.
+
+The Bufmap
+----------
+
+At startup userspace allocates two page-size-aligned (posix_memalign)
+mlocked memory buffers, one is used for IO and one is used for readdir
+operations. The IO buffer is 41943040 bytes and the readdir buffer is
+4194304 bytes. Each buffer contains logical chunks, or partitions, and
+a pointer to each buffer is added to its own PVFS_dev_map_desc structure
+which also describes its total size, as well as the size and number of
+the partitions.
+
+A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
+mapping routine in the kernel module with an ioctl. The structure is
+copied from user space to kernel space with copy_from_user and is used
+to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
+then contains:
+
+  * refcnt
+    - a reference counter
+  * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
+    partition size, which represents the filesystem's block size and
+    is used for s_blocksize in super blocks.
+  * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
+    partitions in the IO buffer.
+  * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
+  * total_size - the total size of the IO buffer.
+  * page_count - the number of 4096 byte pages in the IO buffer.
+  * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
+    of kcalloced memory. This memory is used as an array of pointers
+    to each of the pages in the IO buffer through a call to get_user_pages.
+  * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
+    bytes of kcalloced memory. This memory is further intialized:
+
+      user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
+      structure. user_desc->ptr points to the IO buffer.
+
+      ::
+
+       pages_per_desc = bufmap->desc_size / PAGE_SIZE
+       offset = 0
+
+        bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
+        bufmap->desc_array[0].array_count = pages_per_desc = 1024
+        bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
+        offset += 1024
+                           .
+                           .
+                           .
+        bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
+        bufmap->desc_array[9].array_count = pages_per_desc = 1024
+        bufmap->desc_array[9].uaddr = (user_desc->ptr) +
+                                               (9 * 1024 * 4096)
+        offset += 1024
+
+  * buffer_index_array - a desc_count sized array of ints, used to
+    indicate which of the IO buffer's partitions are available to use.
+  * buffer_index_lock - a spinlock to protect buffer_index_array during update.
+  * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
+    int array used to indicate which of the readdir buffer's partitions are
+    available to use.
+  * readdir_index_lock - a spinlock to protect readdir_index_array during
+    update.
+
+Operations
+----------
+
+The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
+needs to communicate with userspace. Part of the op contains the "upcall"
+which expresses the request to userspace. Part of the op eventually
+contains the "downcall" which expresses the results of the request.
+
+The slab allocator is used to keep a cache of op structures handy.
+
+At init time the kernel module defines and initializes a request list
+and an in_progress hash table to keep track of all the ops that are
+in flight at any given time.
+
+Ops are stateful:
+
+ * unknown
+           - op was just initialized
+ * waiting
+           - op is on request_list (upward bound)
+ * inprogr
+           - op is in progress (waiting for downcall)
+ * serviced
+           - op has matching downcall; ok
+ * purged
+           - op has to start a timer since client-core
+              exited uncleanly before servicing op
+ * given up
+           - submitter has given up waiting for it
+
+When some arbitrary userspace program needs to perform a
+filesystem operation on Orangefs (readdir, I/O, create, whatever)
+an op structure is initialized and tagged with a distinguishing ID
+number. The upcall part of the op is filled out, and the op is
+passed to the "service_operation" function.
+
+Service_operation changes the op's state to "waiting", puts
+it on the request list, and signals the Orangefs file_operations.poll
+function through a wait queue. Userspace is polling the pseudo-device
+and thus becomes aware of the upcall request that needs to be read.
+
+When the Orangefs file_operations.read function is triggered, the
+request list is searched for an op that seems ready-to-process.
+The op is removed from the request list. The tag from the op and
+the filled-out upcall struct are copy_to_user'ed back to userspace.
+
+If any of these (and some additional protocol) copy_to_users fail,
+the op's state is set to "waiting" and the op is added back to
+the request list. Otherwise, the op's state is changed to "in progress",
+and the op is hashed on its tag and put onto the end of a list in the
+in_progress hash table at the index the tag hashed to.
+
+When userspace has assembled the response to the upcall, it
+writes the response, which includes the distinguishing tag, back to
+the pseudo device in a series of io_vecs. This triggers the Orangefs
+file_operations.write_iter function to find the op with the associated
+tag and remove it from the in_progress hash table. As long as the op's
+state is not "canceled" or "given up", its state is set to "serviced".
+The file_operations.write_iter function returns to the waiting vfs,
+and back to service_operation through wait_for_matching_downcall.
+
+Service operation returns to its caller with the op's downcall
+part (the response to the upcall) filled out.
+
+The "client-core" is the bridge between the kernel module and
+userspace. The client-core is a daemon. The client-core has an
+associated watchdog daemon. If the client-core is ever signaled
+to die, the watchdog daemon restarts the client-core. Even though
+the client-core is restarted "right away", there is a period of
+time during such an event that the client-core is dead. A dead client-core
+can't be triggered by the Orangefs file_operations.poll function.
+Ops that pass through service_operation during a "dead spell" can timeout
+on the wait queue and one attempt is made to recycle them. Obviously,
+if the client-core stays dead too long, the arbitrary userspace processes
+trying to use Orangefs will be negatively affected. Waiting ops
+that can't be serviced will be removed from the request list and
+have their states set to "given up". In-progress ops that can't
+be serviced will be removed from the in_progress hash table and
+have their states set to "given up".
+
+Readdir and I/O ops are atypical with respect to their payloads.
+
+  - readdir ops use the smaller of the two pre-allocated pre-partitioned
+    memory buffers. The readdir buffer is only available to userspace.
+    The kernel module obtains an index to a free partition before launching
+    a readdir op. Userspace deposits the results into the indexed partition
+    and then writes them to back to the pvfs device.
+
+  - io (read and write) ops use the larger of the two pre-allocated
+    pre-partitioned memory buffers. The IO buffer is accessible from
+    both userspace and the kernel module. The kernel module obtains an
+    index to a free partition before launching an io op. The kernel module
+    deposits write data into the indexed partition, to be consumed
+    directly by userspace. Userspace deposits the results of read
+    requests into the indexed partition, to be consumed directly
+    by the kernel module.
+
+Responses to kernel requests are all packaged in pvfs2_downcall_t
+structs. Besides a few other members, pvfs2_downcall_t contains a
+union of structs, each of which is associated with a particular
+response type.
+
+The several members outside of the union are:
+
+ ``int32_t type``
+    - type of operation.
+ ``int32_t status``
+    - return code for the operation.
+ ``int64_t trailer_size``
+    - 0 unless readdir operation.
+ ``char *trailer_buf``
+    - initialized to NULL, used during readdir operations.
+
+The appropriate member inside the union is filled out for any
+particular response.
+
+  PVFS2_VFS_OP_FILE_IO
+    fill a pvfs2_io_response_t
+
+  PVFS2_VFS_OP_LOOKUP
+    fill a PVFS_object_kref
+
+  PVFS2_VFS_OP_CREATE
+    fill a PVFS_object_kref
+
+  PVFS2_VFS_OP_SYMLINK
+    fill a PVFS_object_kref
+
+  PVFS2_VFS_OP_GETATTR
+    fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
+    fill in a string with the link target when the object is a symlink.
+
+  PVFS2_VFS_OP_MKDIR
+    fill a PVFS_object_kref
+
+  PVFS2_VFS_OP_STATFS
+    fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
+    us to know, in a timely fashion, these statistics about our
+    distributed network filesystem.
+
+  PVFS2_VFS_OP_FS_MOUNT
+    fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
+    except its members are in a different order and "__pad1" is replaced
+    with "id".
+
+  PVFS2_VFS_OP_GETXATTR
+    fill a pvfs2_getxattr_response_t
+
+  PVFS2_VFS_OP_LISTXATTR
+    fill a pvfs2_listxattr_response_t
+
+  PVFS2_VFS_OP_PARAM
+    fill a pvfs2_param_response_t
+
+  PVFS2_VFS_OP_PERF_COUNT
+    fill a pvfs2_perf_count_response_t
+
+  PVFS2_VFS_OP_FSKEY
+    file a pvfs2_fs_key_response_t
+
+  PVFS2_VFS_OP_READDIR
+    jamb everything needed to represent a pvfs2_readdir_response_t into
+    the readdir buffer descriptor specified in the upcall.
+
+Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
+made by the kernel side.
+
+A buffer_list containing:
+
+  - a pointer to the prepared response to the request from the
+    kernel (struct pvfs2_downcall_t).
+  - and also, in the case of a readdir request, a pointer to a
+    buffer containing descriptors for the objects in the target
+    directory.
+
+... is sent to the function (PINT_dev_write_list) which performs
+the writev.
+
+PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
+
+The first four elements of io_array are initialized like this for all
+responses::
+
+  io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
+  io_array[0].iov_len = sizeof(int32_t)
+
+  io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
+  io_array[1].iov_len = sizeof(int32_t)
+
+  io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
+  io_array[2].iov_len = sizeof(int64_t)
+
+  io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
+                         of global variable vfs_request (vfs_request_t)
+  io_array[3].iov_len = sizeof(pvfs2_downcall_t)
+
+Readdir responses initialize the fifth element io_array like this::
+
+  io_array[4].iov_base = contents of member trailer_buf (char *)
+                         from out_downcall member of global variable
+                         vfs_request
+  io_array[4].iov_len = contents of member trailer_size (PVFS_size)
+                        from out_downcall member of global variable
+                        vfs_request
+
+Orangefs exploits the dcache in order to avoid sending redundant
+requests to userspace. We keep object inode attributes up-to-date with
+orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
+help it decide whether or not to update an inode: "new" and "bypass".
+Orangefs keeps private data in an object's inode that includes a short
+timeout value, getattr_time, which allows any iteration of
+orangefs_inode_getattr to know how long it has been since the inode was
+updated. When the object is not new (new == 0) and the bypass flag is not
+set (bypass == 0) orangefs_inode_getattr returns without updating the inode
+if getattr_time has not timed out. Getattr_time is updated each time the
+inode is updated.
+
+Creation of a new object (file, dir, sym-link) includes the evaluation of
+its pathname, resulting in a negative directory entry for the object.
+A new inode is allocated and associated with the dentry, turning it from
+a negative dentry into a "productive full member of society". Orangefs
+obtains the new inode from Linux with new_inode() and associates
+the inode with the dentry by sending the pair back to Linux with
+d_instantiate().
+
+The evaluation of a pathname for an object resolves to its corresponding
+dentry. If there is no corresponding dentry, one is created for it in
+the dcache. Whenever a dentry is modified or verified Orangefs stores a
+short timeout value in the dentry's d_time, and the dentry will be trusted
+for that amount of time. Orangefs is a network filesystem, and objects
+can potentially change out-of-band with any particular Orangefs kernel module
+instance, so trusting a dentry is risky. The alternative to trusting
+dentries is to always obtain the needed information from userspace - at
+least a trip to the client-core, maybe to the servers. Obtaining information
+from a dentry is cheap, obtaining it from userspace is relatively expensive,
+hence the motivation to use the dentry when possible.
+
+The timeout values d_time and getattr_time are jiffy based, and the
+code is designed to avoid the jiffy-wrap problem::
+
+    "In general, if the clock may have wrapped around more than once, there
+    is no way to tell how much time has elapsed. However, if the times t1
+    and t2 are known to be fairly close, we can reliably compute the
+    difference in a way that takes into account the possibility that the
+    clock may have wrapped between times."
+
+from course notes by instructor Andy Wang
+
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt

deleted file mode 100644 (file)

index f4ba949..0000000
--- a/Documentation/filesystems/orangefs.txt
+++ /dev/null
@@ -1,529 +0,0 @@
-ORANGEFS
-========
-
-OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
-for large storage problems faced by HPC, BigData, Streaming Video,
-Genomics, Bioinformatics.
-
-Orangefs, originally called PVFS, was first developed in 1993 by
-Walt Ligon and Eric Blumer as a parallel file system for Parallel
-Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns
-of parallel programs.
-
-Orangefs features include:
-
-  * Distributes file data among multiple file servers
-  * Supports simultaneous access by multiple clients
-  * Stores file data and metadata on servers using local file system
-    and access methods
-  * Userspace implementation is easy to install and maintain
-  * Direct MPI support
-  * Stateless
-
-
-MAILING LIST ARCHIVES
-=====================
-
-http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
-
-
-MAILING LIST SUBMISSIONS
-========================
-
-devel@lists.orangefs.org
-
-
-DOCUMENTATION
-=============
-
-http://www.orangefs.org/documentation/
-
-
-USERSPACE FILESYSTEM SOURCE
-===========================
-
-http://www.orangefs.org/download
-
-Orangefs versions prior to 2.9.3 would not be compatible with the
-upstream version of the kernel client.
-
-
-RUNNING ORANGEFS ON A SINGLE SERVER
-===================================
-
-OrangeFS is usually run in large installations with multiple servers and
-clients, but a complete filesystem can be run on a single machine for
-development and testing.
-
-On Fedora, install orangefs and orangefs-server.
-
-dnf -y install orangefs orangefs-server
-
-There is an example server configuration file in
-/etc/orangefs/orangefs.conf.  Change localhost to your hostname if
-necessary.
-
-To generate a filesystem to run xfstests against, see below.
-
-There is an example client configuration file in /etc/pvfs2tab.  It is a
-single line.  Uncomment it and change the hostname if necessary.  This
-controls clients which use libpvfs2.  This does not control the
-pvfs2-client-core.
-
-Create the filesystem.
-
-pvfs2-server -f /etc/orangefs/orangefs.conf
-
-Start the server.
-
-systemctl start orangefs-server
-
-Test the server.
-
-pvfs2-ping -m /pvfsmnt
-
-Start the client.  The module must be compiled in or loaded before this
-point.
-
-systemctl start orangefs-client
-
-Mount the filesystem.
-
-mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
-
-
-BUILDING ORANGEFS ON A SINGLE SERVER
-====================================
-
-Where OrangeFS cannot be installed from distribution packages, it may be
-built from source.
-
-You can omit --prefix if you don't care that things are sprinkled around
-in /usr/local.  As of version 2.9.6, OrangeFS uses Berkeley DB by
-default, we will probably be changing the default to LMDB soon.
-
-./configure --prefix=/opt/ofs --with-db-backend=lmdb
-
-make
-
-make install
-
-Create an orangefs config file.
-
-/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
-
-Create an /etc/pvfs2tab file.
-
-echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
-    /etc/pvfs2tab
-
-Create the mount point you specified in the tab file if needed.
-
-mkdir /pvfsmnt
-
-Bootstrap the server.
-
-/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
-
-Start the server.
-
-/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
-
-Now the server should be running. Pvfs2-ls is a simple
-test to verify that the server is running.
-
-/opt/ofs/bin/pvfs2-ls /pvfsmnt
-
-If stuff seems to be working, load the kernel module and
-turn on the client core.
-
-/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
-
-Mount your filesystem.
-
-mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
-
-
-RUNNING XFSTESTS
-================
-
-It is useful to use a scratch filesystem with xfstests.  This can be
-done with only one server.
-
-Make a second copy of the FileSystem section in the server configuration
-file, which is /etc/orangefs/orangefs.conf.  Change the Name to scratch.
-Change the ID to something other than the ID of the first FileSystem
-section (2 is usually a good choice).
-
-Then there are two FileSystem sections: orangefs and scratch.
-
-This change should be made before creating the filesystem.
-
-pvfs2-server -f /etc/orangefs/orangefs.conf
-
-To run xfstests, create /etc/xfsqa.config.
-
-TEST_DIR=/orangefs
-TEST_DEV=tcp://localhost:3334/orangefs
-SCRATCH_MNT=/scratch
-SCRATCH_DEV=tcp://localhost:3334/scratch
-
-Then xfstests can be run
-
-./check -pvfs2
-
-
-OPTIONS
-=======
-
-The following mount options are accepted:
-
-  acl
-    Allow the use of Access Control Lists on files and directories.
-
-  intr
-    Some operations between the kernel client and the user space
-    filesystem can be interruptible, such as changes in debug levels
-    and the setting of tunable parameters.
-
-  local_lock
-    Enable posix locking from the perspective of "this" kernel. The
-    default file_operations lock action is to return ENOSYS. Posix
-    locking kicks in if the filesystem is mounted with -o local_lock.
-    Distributed locking is being worked on for the future.
-
-
-DEBUGGING
-=========
-
-If you want the debug (GOSSIP) statements in a particular
-source file (inode.c for example) go to syslog:
-
-  echo inode > /sys/kernel/debug/orangefs/kernel-debug
-
-No debugging (the default):
-
-  echo none > /sys/kernel/debug/orangefs/kernel-debug
-
-Debugging from several source files:
-
-  echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
-
-All debugging:
-
-  echo all > /sys/kernel/debug/orangefs/kernel-debug
-
-Get a list of all debugging keywords:
-
-  cat /sys/kernel/debug/orangefs/debug-help
-
-
-PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE
-============================================
-
-Orangefs is a user space filesystem and an associated kernel module.
-We'll just refer to the user space part of Orangefs as "userspace"
-from here on out. Orangefs descends from PVFS, and userspace code
-still uses PVFS for function and variable names. Userspace typedefs
-many of the important structures. Function and variable names in
-the kernel module have been transitioned to "orangefs", and The Linux
-Coding Style avoids typedefs, so kernel module structures that
-correspond to userspace structures are not typedefed.
-
-The kernel module implements a pseudo device that userspace
-can read from and write to. Userspace can also manipulate the
-kernel module through the pseudo device with ioctl.
-
-THE BUFMAP:
-
-At startup userspace allocates two page-size-aligned (posix_memalign)
-mlocked memory buffers, one is used for IO and one is used for readdir
-operations. The IO buffer is 41943040 bytes and the readdir buffer is
-4194304 bytes. Each buffer contains logical chunks, or partitions, and
-a pointer to each buffer is added to its own PVFS_dev_map_desc structure
-which also describes its total size, as well as the size and number of
-the partitions.
-
-A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
-mapping routine in the kernel module with an ioctl. The structure is
-copied from user space to kernel space with copy_from_user and is used
-to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
-then contains:
-
-  * refcnt - a reference counter
-  * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
-    partition size, which represents the filesystem's block size and
-    is used for s_blocksize in super blocks.
-  * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
-    partitions in the IO buffer.
-  * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
-  * total_size - the total size of the IO buffer.
-  * page_count - the number of 4096 byte pages in the IO buffer.
-  * page_array - a pointer to page_count * (sizeof(struct page*)) bytes
-    of kcalloced memory. This memory is used as an array of pointers
-    to each of the pages in the IO buffer through a call to get_user_pages.
-  * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc))
-    bytes of kcalloced memory. This memory is further intialized:
-
-      user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
-      structure. user_desc->ptr points to the IO buffer.
-
-      pages_per_desc = bufmap->desc_size / PAGE_SIZE
-      offset = 0
-
-        bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
-        bufmap->desc_array[0].array_count = pages_per_desc = 1024
-        bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
-        offset += 1024
-                           .
-                           .
-                           .
-        bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
-        bufmap->desc_array[9].array_count = pages_per_desc = 1024
-        bufmap->desc_array[9].uaddr = (user_desc->ptr) +
-                                               (9 * 1024 * 4096)
-        offset += 1024
-
-  * buffer_index_array - a desc_count sized array of ints, used to
-    indicate which of the IO buffer's partitions are available to use.
-  * buffer_index_lock - a spinlock to protect buffer_index_array during update.
-  * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
-    int array used to indicate which of the readdir buffer's partitions are
-    available to use.
-  * readdir_index_lock - a spinlock to protect readdir_index_array during
-    update.
-
-OPERATIONS:
-
-The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
-needs to communicate with userspace. Part of the op contains the "upcall"
-which expresses the request to userspace. Part of the op eventually
-contains the "downcall" which expresses the results of the request.
-
-The slab allocator is used to keep a cache of op structures handy.
-
-At init time the kernel module defines and initializes a request list
-and an in_progress hash table to keep track of all the ops that are
-in flight at any given time.
-
-Ops are stateful:
-
- * unknown  - op was just initialized
- * waiting  - op is on request_list (upward bound)
- * inprogr  - op is in progress (waiting for downcall)
- * serviced - op has matching downcall; ok
- * purged   - op has to start a timer since client-core
-              exited uncleanly before servicing op
- * given up - submitter has given up waiting for it
-
-When some arbitrary userspace program needs to perform a
-filesystem operation on Orangefs (readdir, I/O, create, whatever)
-an op structure is initialized and tagged with a distinguishing ID
-number. The upcall part of the op is filled out, and the op is
-passed to the "service_operation" function.
-
-Service_operation changes the op's state to "waiting", puts
-it on the request list, and signals the Orangefs file_operations.poll
-function through a wait queue. Userspace is polling the pseudo-device
-and thus becomes aware of the upcall request that needs to be read.
-
-When the Orangefs file_operations.read function is triggered, the
-request list is searched for an op that seems ready-to-process.
-The op is removed from the request list. The tag from the op and
-the filled-out upcall struct are copy_to_user'ed back to userspace.
-
-If any of these (and some additional protocol) copy_to_users fail,
-the op's state is set to "waiting" and the op is added back to
-the request list. Otherwise, the op's state is changed to "in progress",
-and the op is hashed on its tag and put onto the end of a list in the
-in_progress hash table at the index the tag hashed to.
-
-When userspace has assembled the response to the upcall, it
-writes the response, which includes the distinguishing tag, back to
-the pseudo device in a series of io_vecs. This triggers the Orangefs
-file_operations.write_iter function to find the op with the associated
-tag and remove it from the in_progress hash table. As long as the op's
-state is not "canceled" or "given up", its state is set to "serviced".
-The file_operations.write_iter function returns to the waiting vfs,
-and back to service_operation through wait_for_matching_downcall.
-
-Service operation returns to its caller with the op's downcall
-part (the response to the upcall) filled out.
-
-The "client-core" is the bridge between the kernel module and
-userspace. The client-core is a daemon. The client-core has an
-associated watchdog daemon. If the client-core is ever signaled
-to die, the watchdog daemon restarts the client-core. Even though
-the client-core is restarted "right away", there is a period of
-time during such an event that the client-core is dead. A dead client-core
-can't be triggered by the Orangefs file_operations.poll function.
-Ops that pass through service_operation during a "dead spell" can timeout
-on the wait queue and one attempt is made to recycle them. Obviously,
-if the client-core stays dead too long, the arbitrary userspace processes
-trying to use Orangefs will be negatively affected. Waiting ops
-that can't be serviced will be removed from the request list and
-have their states set to "given up". In-progress ops that can't
-be serviced will be removed from the in_progress hash table and
-have their states set to "given up".
-
-Readdir and I/O ops are atypical with respect to their payloads.
-
-  - readdir ops use the smaller of the two pre-allocated pre-partitioned
-    memory buffers. The readdir buffer is only available to userspace.
-    The kernel module obtains an index to a free partition before launching
-    a readdir op. Userspace deposits the results into the indexed partition
-    and then writes them to back to the pvfs device.
-
-  - io (read and write) ops use the larger of the two pre-allocated
-    pre-partitioned memory buffers. The IO buffer is accessible from
-    both userspace and the kernel module. The kernel module obtains an
-    index to a free partition before launching an io op. The kernel module
-    deposits write data into the indexed partition, to be consumed
-    directly by userspace. Userspace deposits the results of read
-    requests into the indexed partition, to be consumed directly
-    by the kernel module.
-
-Responses to kernel requests are all packaged in pvfs2_downcall_t
-structs. Besides a few other members, pvfs2_downcall_t contains a
-union of structs, each of which is associated with a particular
-response type.
-
-The several members outside of the union are:
- - int32_t type - type of operation.
- - int32_t status - return code for the operation.
- - int64_t trailer_size - 0 unless readdir operation.
- - char *trailer_buf - initialized to NULL, used during readdir operations.
-
-The appropriate member inside the union is filled out for any
-particular response.
-
-  PVFS2_VFS_OP_FILE_IO
-    fill a pvfs2_io_response_t
-
-  PVFS2_VFS_OP_LOOKUP
-    fill a PVFS_object_kref
-
-  PVFS2_VFS_OP_CREATE
-    fill a PVFS_object_kref
-
-  PVFS2_VFS_OP_SYMLINK
-    fill a PVFS_object_kref
-
-  PVFS2_VFS_OP_GETATTR
-    fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
-    fill in a string with the link target when the object is a symlink.
-
-  PVFS2_VFS_OP_MKDIR
-    fill a PVFS_object_kref
-
-  PVFS2_VFS_OP_STATFS
-    fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
-    us to know, in a timely fashion, these statistics about our
-    distributed network filesystem.
-
-  PVFS2_VFS_OP_FS_MOUNT
-    fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
-    except its members are in a different order and "__pad1" is replaced
-    with "id".
-
-  PVFS2_VFS_OP_GETXATTR
-    fill a pvfs2_getxattr_response_t
-
-  PVFS2_VFS_OP_LISTXATTR
-    fill a pvfs2_listxattr_response_t
-
-  PVFS2_VFS_OP_PARAM
-    fill a pvfs2_param_response_t
-
-  PVFS2_VFS_OP_PERF_COUNT
-    fill a pvfs2_perf_count_response_t
-
-  PVFS2_VFS_OP_FSKEY
-    file a pvfs2_fs_key_response_t
-
-  PVFS2_VFS_OP_READDIR
-    jamb everything needed to represent a pvfs2_readdir_response_t into
-    the readdir buffer descriptor specified in the upcall.
-
-Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
-made by the kernel side.
-
-A buffer_list containing:
-  - a pointer to the prepared response to the request from the
-    kernel (struct pvfs2_downcall_t).
-  - and also, in the case of a readdir request, a pointer to a
-    buffer containing descriptors for the objects in the target
-    directory.
-... is sent to the function (PINT_dev_write_list) which performs
-the writev.
-
-PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
-
-The first four elements of io_array are initialized like this for all
-responses:
-
-  io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
-  io_array[0].iov_len = sizeof(int32_t)
-
-  io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
-  io_array[1].iov_len = sizeof(int32_t)
-
-  io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
-  io_array[2].iov_len = sizeof(int64_t)
-
-  io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
-                         of global variable vfs_request (vfs_request_t)
-  io_array[3].iov_len = sizeof(pvfs2_downcall_t)
-
-Readdir responses initialize the fifth element io_array like this:
-
-  io_array[4].iov_base = contents of member trailer_buf (char *)
-                         from out_downcall member of global variable
-                         vfs_request
-  io_array[4].iov_len = contents of member trailer_size (PVFS_size)
-                        from out_downcall member of global variable
-                        vfs_request
-
-Orangefs exploits the dcache in order to avoid sending redundant
-requests to userspace. We keep object inode attributes up-to-date with
-orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
-help it decide whether or not to update an inode: "new" and "bypass".
-Orangefs keeps private data in an object's inode that includes a short
-timeout value, getattr_time, which allows any iteration of
-orangefs_inode_getattr to know how long it has been since the inode was
-updated. When the object is not new (new == 0) and the bypass flag is not
-set (bypass == 0) orangefs_inode_getattr returns without updating the inode
-if getattr_time has not timed out. Getattr_time is updated each time the
-inode is updated.
-
-Creation of a new object (file, dir, sym-link) includes the evaluation of
-its pathname, resulting in a negative directory entry for the object.
-A new inode is allocated and associated with the dentry, turning it from
-a negative dentry into a "productive full member of society". Orangefs
-obtains the new inode from Linux with new_inode() and associates
-the inode with the dentry by sending the pair back to Linux with
-d_instantiate().
-
-The evaluation of a pathname for an object resolves to its corresponding
-dentry. If there is no corresponding dentry, one is created for it in
-the dcache. Whenever a dentry is modified or verified Orangefs stores a
-short timeout value in the dentry's d_time, and the dentry will be trusted
-for that amount of time. Orangefs is a network filesystem, and objects
-can potentially change out-of-band with any particular Orangefs kernel module
-instance, so trusting a dentry is risky. The alternative to trusting
-dentries is to always obtain the needed information from userspace - at
-least a trip to the client-core, maybe to the servers. Obtaining information
-from a dentry is cheap, obtaining it from userspace is relatively expensive,
-hence the motivation to use the dentry when possible.
-
-The timeout values d_time and getattr_time are jiffy based, and the
-code is designed to avoid the jiffy-wrap problem:
-
-"In general, if the clock may have wrapped around more than once, there
-is no way to tell how much time has elapsed. However, if the times t1
-and t2 are known to be fairly close, we can reliably compute the
-difference in a way that takes into account the possibility that the
-clock may have wrapped between times."
-
-                      from course notes by instructor Andy Wang
-
author	Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
	Mon, 17 Feb 2020 16:12:17 +0000 (17:12 +0100)
committer	Jonathan Corbet <corbet@lwn.net>
	Mon, 2 Mar 2020 21:04:40 +0000 (14:04 -0700)
Documentation/filesystems/index.rst		patch \| blob \| history
Documentation/filesystems/orangefs.rst	[new file with mode: 0644]	patch \| blob
Documentation/filesystems/orangefs.txt	[deleted file]	patch \| blob \| history