Documentation/filesystems/ext4/journal.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 Journal (jbd2)
   4 --------------
   5
   6 Introduced in ext3, the ext4 filesystem employs a journal to protect the
   7 filesystem against metadata inconsistencies in the case of a system crash. Up
   8 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
   9 size limits) can be reserved inside the filesystem as a place to land
  10 “important” data writes on-disk as quickly as possible. Once the important
  11 data transaction is fully written to the disk and flushed from the disk write
  12 cache, a record of the data being committed is also written to the journal. At
  13 some later point in time, the journal code writes the transactions to their
  14 final locations on disk (this could involve a lot of seeking or a lot of small
  15 read-write-erases) before erasing the commit record. Should the system
  16 crash during the second slow write, the journal can be replayed all the
  17 way to the latest commit record, guaranteeing the atomicity of whatever
  18 gets written through the journal to the disk. The effect of this is to
  19 guarantee that the filesystem does not become stuck midway through a
  20 metadata update.
  21
  22 For performance reasons, ext4 by default only writes filesystem metadata
  23 through the journal. This means that file data blocks are /not/
  24 guaranteed to be in any consistent state after a crash. If this default
  25 guarantee level (``data=ordered``) is not satisfactory, there is a mount
  26 option to control journal behavior. If ``data=journal``, all data and
  27 metadata are written to disk through the journal. This is slower but
  28 safest. If ``data=writeback``, dirty data blocks are not flushed to the
  29 disk before the metadata are written to disk through the journal.
  30
  31 In case of ``data=ordered`` mode, Ext4 also supports fast commits which
  32 help reduce commit latency significantly. The default ``data=ordered``
  33 mode works by logging metadata blocks to the journal. In fast commit
  34 mode, Ext4 only stores the minimal delta needed to recreate the
  35 affected metadata in fast commit space that is shared with JBD2.
  36 Once the fast commit area fills in or if fast commit is not possible
  37 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
  38 A full commit invalidates all the fast commits that happened before
  39 it and thus it makes the fast commit area empty for further fast
  40 commits. This feature needs to be enabled at mkfs time.
  41
  42 The journal inode is typically inode 8. The first 68 bytes of the
  43 journal inode are replicated in the ext4 superblock. The journal itself
  44 is normal (but hidden) file within the filesystem. The file usually
  45 consumes an entire block group, though mke2fs tries to put it in the
  46 middle of the disk.
  47
  48 All fields in jbd2 are written to disk in big-endian order. This is the
  49 opposite of ext4.
  50
  51 NOTE: Both ext4 and ocfs2 use jbd2.
  52
  53 The maximum size of a journal embedded in an ext4 filesystem is 2^32
  54 blocks. jbd2 itself does not seem to care.
  55
  56 Layout
  57 ~~~~~~
  58
  59 Generally speaking, the journal has this format:
  60
  61 .. list-table::
  62    :widths: 16 48 16
  63    :header-rows: 1
  64
  65    * - Superblock
  66      - descriptor_block (data_blocks or revocation_block) [more data or
  67        revocations] commmit_block
  68      - [more transactions...]
  69    * -
  70      - One transaction
  71      -
  72
  73 Notice that a transaction begins with either a descriptor and some data,
  74 or a block revocation list. A finished transaction always ends with a
  75 commit. If there is no commit record (or the checksums don't match), the
  76 transaction will be discarded during replay.
  77
  78 External Journal
  79 ~~~~~~~~~~~~~~~~
  80
  81 Optionally, an ext4 filesystem can be created with an external journal
  82 device (as opposed to an internal journal, which uses a reserved inode).
  83 In this case, on the filesystem device, ``s_journal_inum`` should be
  84 zero and ``s_journal_uuid`` should be set. On the journal device there
  85 will be an ext4 super block in the usual place, with a matching UUID.
  86 The journal superblock will be in the next full block after the
  87 superblock.
  88
  89 .. list-table::
  90    :widths: 12 12 12 32 12
  91    :header-rows: 1
  92
  93    * - 1024 bytes of padding
  94      - ext4 Superblock
  95      - Journal Superblock
  96      - descriptor_block (data_blocks or revocation_block) [more data or
  97        revocations] commmit_block
  98      - [more transactions...]
  99    * -
 100      -
 101      -
 102      - One transaction
 103      -
 104
 105 Block Header
 106 ~~~~~~~~~~~~
 107
 108 Every block in the journal starts with a common 12-byte header
 109 ``struct journal_header_s``:
 110
 111 .. list-table::
 112    :widths: 8 8 24 40
 113    :header-rows: 1
 114
 115    * - Offset
 116      - Type
 117      - Name
 118      - Description
 119    * - 0x0
 120      - __be32
 121      - h_magic
 122      - jbd2 magic number, 0xC03B3998.
 123    * - 0x4
 124      - __be32
 125      - h_blocktype
 126      - Description of what this block contains. See the jbd2_blocktype_ table
 127        below.
 128    * - 0x8
 129      - __be32
 130      - h_sequence
 131      - The transaction ID that goes with this block.
 132
 133 .. _jbd2_blocktype:
 134
 135 The journal block type can be any one of:
 136
 137 .. list-table::
 138    :widths: 16 64
 139    :header-rows: 1
 140
 141    * - Value
 142      - Description
 143    * - 1
 144      - Descriptor. This block precedes a series of data blocks that were
 145        written through the journal during a transaction.
 146    * - 2
 147      - Block commit record. This block signifies the completion of a
 148        transaction.
 149    * - 3
 150      - Journal superblock, v1.
 151    * - 4
 152      - Journal superblock, v2.
 153    * - 5
 154      - Block revocation records. This speeds up recovery by enabling the
 155        journal to skip writing blocks that were subsequently rewritten.
 156
 157 Super Block
 158 ~~~~~~~~~~~
 159
 160 The super block for the journal is much simpler as compared to ext4's.
 161 The key data kept within are size of the journal, and where to find the
 162 start of the log of transactions.
 163
 164 The journal superblock is recorded as ``struct journal_superblock_s``,
 165 which is 1024 bytes long:
 166
 167 .. list-table::
 168    :widths: 8 8 24 40
 169    :header-rows: 1
 170
 171    * - Offset
 172      - Type
 173      - Name
 174      - Description
 175    * -
 176      -
 177      -
 178      - Static information describing the journal.
 179    * - 0x0
 180      - journal_header_t (12 bytes)
 181      - s_header
 182      - Common header identifying this as a superblock.
 183    * - 0xC
 184      - __be32
 185      - s_blocksize
 186      - Journal device block size.
 187    * - 0x10
 188      - __be32
 189      - s_maxlen
 190      - Total number of blocks in this journal.
 191    * - 0x14
 192      - __be32
 193      - s_first
 194      - First block of log information.
 195    * -
 196      -
 197      -
 198      - Dynamic information describing the current state of the log.
 199    * - 0x18
 200      - __be32
 201      - s_sequence
 202      - First commit ID expected in log.
 203    * - 0x1C
 204      - __be32
 205      - s_start
 206      - Block number of the start of log. Contrary to the comments, this field
 207        being zero does not imply that the journal is clean!
 208    * - 0x20
 209      - __be32
 210      - s_errno
 211      - Error value, as set by jbd2_journal_abort().
 212    * -
 213      -
 214      -
 215      - The remaining fields are only valid in a v2 superblock.
 216    * - 0x24
 217      - __be32
 218      - s_feature_compat;
 219      - Compatible feature set. See the table jbd2_compat_ below.
 220    * - 0x28
 221      - __be32
 222      - s_feature_incompat
 223      - Incompatible feature set. See the table jbd2_incompat_ below.
 224    * - 0x2C
 225      - __be32
 226      - s_feature_ro_compat
 227      - Read-only compatible feature set. There aren't any of these currently.
 228    * - 0x30
 229      - __u8
 230      - s_uuid[16]
 231      - 128-bit uuid for journal. This is compared against the copy in the ext4
 232        super block at mount time.
 233    * - 0x40
 234      - __be32
 235      - s_nr_users
 236      - Number of file systems sharing this journal.
 237    * - 0x44
 238      - __be32
 239      - s_dynsuper
 240      - Location of dynamic super block copy. (Not used?)
 241    * - 0x48
 242      - __be32
 243      - s_max_transaction
 244      - Limit of journal blocks per transaction. (Not used?)
 245    * - 0x4C
 246      - __be32
 247      - s_max_trans_data
 248      - Limit of data blocks per transaction. (Not used?)
 249    * - 0x50
 250      - __u8
 251      - s_checksum_type
 252      - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
 253        more info.
 254    * - 0x51
 255      - __u8[3]
 256      - s_padding2
 257      -
 258    * - 0x54
 259      - __be32
 260      - s_num_fc_blocks
 261      - Number of fast commit blocks in the journal.
 262    * - 0x58
 263      - __u32
 264      - s_padding[42]
 265      -
 266    * - 0xFC
 267      - __be32
 268      - s_checksum
 269      - Checksum of the entire superblock, with this field set to zero.
 270    * - 0x100
 271      - __u8
 272      - s_users[16*48]
 273      - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
 274        shared external journals, but I imagine Lustre (or ocfs2?), which use
 275        the jbd2 code, might.
 276
 277 .. _jbd2_compat:
 278
 279 The journal compat features are any combination of the following:
 280
 281 .. list-table::
 282    :widths: 16 64
 283    :header-rows: 1
 284
 285    * - Value
 286      - Description
 287    * - 0x1
 288      - Journal maintains checksums on the data blocks.
 289        (JBD2_FEATURE_COMPAT_CHECKSUM)
 290
 291 .. _jbd2_incompat:
 292
 293 The journal incompat features are any combination of the following:
 294
 295 .. list-table::
 296    :widths: 16 64
 297    :header-rows: 1
 298
 299    * - Value
 300      - Description
 301    * - 0x1
 302      - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
 303    * - 0x2
 304      - Journal can deal with 64-bit block numbers.
 305        (JBD2_FEATURE_INCOMPAT_64BIT)
 306    * - 0x4
 307      - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
 308    * - 0x8
 309      - This journal uses v2 of the checksum on-disk format. Each journal
 310        metadata block gets its own checksum, and the block tags in the
 311        descriptor table contain checksums for each of the data blocks in the
 312        journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
 313    * - 0x10
 314      - This journal uses v3 of the checksum on-disk format. This is the same as
 315        v2, but the journal block tag size is fixed regardless of the size of
 316        block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
 317    * - 0x20
 318      - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
 319
 320 .. _jbd2_checksum_type:
 321
 322 Journal checksum type codes are one of the following.  crc32 or crc32c are the
 323 most likely choices.
 324
 325 .. list-table::
 326    :widths: 16 64
 327    :header-rows: 1
 328
 329    * - Value
 330      - Description
 331    * - 1
 332      - CRC32
 333    * - 2
 334      - MD5
 335    * - 3
 336      - SHA1
 337    * - 4
 338      - CRC32C
 339
 340 Descriptor Block
 341 ~~~~~~~~~~~~~~~~
 342
 343 The descriptor block contains an array of journal block tags that
 344 describe the final locations of the data blocks that follow in the
 345 journal. Descriptor blocks are open-coded instead of being completely
 346 described by a data structure, but here is the block structure anyway.
 347 Descriptor blocks consume at least 36 bytes, but use a full block:
 348
 349 .. list-table::
 350    :widths: 8 8 24 40
 351    :header-rows: 1
 352
 353    * - Offset
 354      - Type
 355      - Name
 356      - Descriptor
 357    * - 0x0
 358      - journal_header_t
 359      - (open coded)
 360      - Common block header.
 361    * - 0xC
 362      - struct journal_block_tag_s
 363      - open coded array[]
 364      - Enough tags either to fill up the block or to describe all the data
 365        blocks that follow this descriptor block.
 366
 367 Journal block tags have any of the following formats, depending on which
 368 journal feature and block tag flags are set.
 369
 370 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
 371 defined as ``struct journal_block_tag3_s``, which looks like the
 372 following. The size is 16 or 32 bytes.
 373
 374 .. list-table::
 375    :widths: 8 8 24 40
 376    :header-rows: 1
 377
 378    * - Offset
 379      - Type
 380      - Name
 381      - Descriptor
 382    * - 0x0
 383      - __be32
 384      - t_blocknr
 385      - Lower 32-bits of the location of where the corresponding data block
 386        should end up on disk.
 387    * - 0x4
 388      - __be32
 389      - t_flags
 390      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
 391        more info.
 392    * - 0x8
 393      - __be32
 394      - t_blocknr_high
 395      - Upper 32-bits of the location of where the corresponding data block
 396        should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
 397        not enabled.
 398    * - 0xC
 399      - __be32
 400      - t_checksum
 401      - Checksum of the journal UUID, the sequence number, and the data block.
 402    * -
 403      -
 404      -
 405      - This field appears to be open coded. It always comes at the end of the
 406        tag, after t_checksum. This field is not present if the "same UUID" flag
 407        is set.
 408    * - 0x8 or 0xC
 409      - char
 410      - uuid[16]
 411      - A UUID to go with this tag. This field appears to be copied from the
 412        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
 413        field.
 414
 415 .. _jbd2_tag_flags:
 416
 417 The journal tag flags are any combination of the following:
 418
 419 .. list-table::
 420    :widths: 16 64
 421    :header-rows: 1
 422
 423    * - Value
 424      - Description
 425    * - 0x1
 426      - On-disk block is escaped. The first four bytes of the data block just
 427        happened to match the jbd2 magic number.
 428    * - 0x2
 429      - This block has the same UUID as previous, therefore the UUID field is
 430        omitted.
 431    * - 0x4
 432      - The data block was deleted by the transaction. (Not used?)
 433    * - 0x8
 434      - This is the last tag in this descriptor block.
 435
 436 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
 437 is defined as ``struct journal_block_tag_s``, which looks like the
 438 following. The size is 8, 12, 24, or 28 bytes:
 439
 440 .. list-table::
 441    :widths: 8 8 24 40
 442    :header-rows: 1
 443
 444    * - Offset
 445      - Type
 446      - Name
 447      - Descriptor
 448    * - 0x0
 449      - __be32
 450      - t_blocknr
 451      - Lower 32-bits of the location of where the corresponding data block
 452        should end up on disk.
 453    * - 0x4
 454      - __be16
 455      - t_checksum
 456      - Checksum of the journal UUID, the sequence number, and the data block.
 457        Note that only the lower 16 bits are stored.
 458    * - 0x6
 459      - __be16
 460      - t_flags
 461      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
 462        more info.
 463    * -
 464      -
 465      -
 466      - This next field is only present if the super block indicates support for
 467        64-bit block numbers.
 468    * - 0x8
 469      - __be32
 470      - t_blocknr_high
 471      - Upper 32-bits of the location of where the corresponding data block
 472        should end up on disk.
 473    * -
 474      -
 475      -
 476      - This field appears to be open coded. It always comes at the end of the
 477        tag, after t_flags or t_blocknr_high. This field is not present if the
 478        "same UUID" flag is set.
 479    * - 0x8 or 0xC
 480      - char
 481      - uuid[16]
 482      - A UUID to go with this tag. This field appears to be copied from the
 483        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
 484        field.
 485
 486 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
 487 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
 488 ``struct jbd2_journal_block_tail``, which looks like this:
 489
 490 .. list-table::
 491    :widths: 8 8 24 40
 492    :header-rows: 1
 493
 494    * - Offset
 495      - Type
 496      - Name
 497      - Descriptor
 498    * - 0x0
 499      - __be32
 500      - t_checksum
 501      - Checksum of the journal UUID + the descriptor block, with this field set
 502        to zero.
 503
 504 Data Block
 505 ~~~~~~~~~~
 506
 507 In general, the data blocks being written to disk through the journal
 508 are written verbatim into the journal file after the descriptor block.
 509 However, if the first four bytes of the block match the jbd2 magic
 510 number then those four bytes are replaced with zeroes and the “escaped”
 511 flag is set in the descriptor block tag.
 512
 513 Revocation Block
 514 ~~~~~~~~~~~~~~~~
 515
 516 A revocation block is used to prevent replay of a block in an earlier
 517 transaction. This is used to mark blocks that were journalled at one
 518 time but are no longer journalled. Typically this happens if a metadata
 519 block is freed and re-allocated as a file data block; in this case, a
 520 journal replay after the file block was written to disk will cause
 521 corruption.
 522
 523 **NOTE**: This mechanism is NOT used to express “this journal block is
 524 superseded by this other journal block”, as the author (djwong)
 525 mistakenly thought. Any block being added to a transaction will cause
 526 the removal of all existing revocation records for that block.
 527
 528 Revocation blocks are described in
 529 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
 530 length, but use a full block:
 531
 532 .. list-table::
 533    :widths: 8 8 24 40
 534    :header-rows: 1
 535
 536    * - Offset
 537      - Type
 538      - Name
 539      - Description
 540    * - 0x0
 541      - journal_header_t
 542      - r_header
 543      - Common block header.
 544    * - 0xC
 545      - __be32
 546      - r_count
 547      - Number of bytes used in this block.
 548    * - 0x10
 549      - __be32 or __be64
 550      - blocks[0]
 551      - Blocks to revoke.
 552
 553 After r_count is a linear array of block numbers that are effectively
 554 revoked by this transaction. The size of each block number is 8 bytes if
 555 the superblock advertises 64-bit block number support, or 4 bytes
 556 otherwise.
 557
 558 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
 559 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
 560 block is a ``struct jbd2_journal_revoke_tail``, which has this format:
 561
 562 .. list-table::
 563    :widths: 8 8 24 40
 564    :header-rows: 1
 565
 566    * - Offset
 567      - Type
 568      - Name
 569      - Description
 570    * - 0x0
 571      - __be32
 572      - r_checksum
 573      - Checksum of the journal UUID + revocation block
 574
 575 Commit Block
 576 ~~~~~~~~~~~~
 577
 578 The commit block is a sentry that indicates that a transaction has been
 579 completely written to the journal. Once this commit block reaches the
 580 journal, the data stored with this transaction can be written to their
 581 final locations on disk.
 582
 583 The commit block is described by ``struct commit_header``, which is 32
 584 bytes long (but uses a full block):
 585
 586 .. list-table::
 587    :widths: 8 8 24 40
 588    :header-rows: 1
 589
 590    * - Offset
 591      - Type
 592      - Name
 593      - Descriptor
 594    * - 0x0
 595      - journal_header_s
 596      - (open coded)
 597      - Common block header.
 598    * - 0xC
 599      - unsigned char
 600      - h_chksum_type
 601      - The type of checksum to use to verify the integrity of the data blocks
 602        in the transaction. See jbd2_checksum_type_ for more info.
 603    * - 0xD
 604      - unsigned char
 605      - h_chksum_size
 606      - The number of bytes used by the checksum. Most likely 4.
 607    * - 0xE
 608      - unsigned char
 609      - h_padding[2]
 610      -
 611    * - 0x10
 612      - __be32
 613      - h_chksum[JBD2_CHECKSUM_BYTES]
 614      - 32 bytes of space to store checksums. If
 615        JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
 616        are set, the first ``__be32`` is the checksum of the journal UUID and
 617        the entire commit block, with this field zeroed. If
 618        JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
 619        crc32 of all the blocks already written to the transaction.
 620    * - 0x30
 621      - __be64
 622      - h_commit_sec
 623      - The time that the transaction was committed, in seconds since the epoch.
 624    * - 0x38
 625      - __be32
 626      - h_commit_nsec
 627      - Nanoseconds component of the above timestamp.
 628
 629 Fast commits
 630 ~~~~~~~~~~~~
 631
 632 Fast commit area is organized as a log of tag length values. Each TLV has
 633 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
 634 of the entire field. It is followed by variable length tag specific value.
 635 Here is the list of supported tags and their meanings:
 636
 637 .. list-table::
 638    :widths: 8 20 20 32
 639    :header-rows: 1
 640
 641    * - Tag
 642      - Meaning
 643      - Value struct
 644      - Description
 645    * - EXT4_FC_TAG_HEAD
 646      - Fast commit area header
 647      - ``struct ext4_fc_head``
 648      - Stores the TID of the transaction after which these fast commits should
 649        be applied.
 650    * - EXT4_FC_TAG_ADD_RANGE
 651      - Add extent to inode
 652      - ``struct ext4_fc_add_range``
 653      - Stores the inode number and extent to be added in this inode
 654    * - EXT4_FC_TAG_DEL_RANGE
 655      - Remove logical offsets to inode
 656      - ``struct ext4_fc_del_range``
 657      - Stores the inode number and the logical offset range that needs to be
 658        removed
 659    * - EXT4_FC_TAG_CREAT
 660      - Create directory entry for a newly created file
 661      - ``struct ext4_fc_dentry_info``
 662      - Stores the parent inode number, inode number and directory entry of the
 663        newly created file
 664    * - EXT4_FC_TAG_LINK
 665      - Link a directory entry to an inode
 666      - ``struct ext4_fc_dentry_info``
 667      - Stores the parent inode number, inode number and directory entry
 668    * - EXT4_FC_TAG_UNLINK
 669      - Unlink a directory entry of an inode
 670      - ``struct ext4_fc_dentry_info``
 671      - Stores the parent inode number, inode number and directory entry
 672
 673    * - EXT4_FC_TAG_PAD
 674      - Padding (unused area)
 675      - None
 676      - Unused bytes in the fast commit area.
 677
 678    * - EXT4_FC_TAG_TAIL
 679      - Mark the end of a fast commit
 680      - ``struct ext4_fc_tail``
 681      - Stores the TID of the commit, CRC of the fast commit of which this tag
 682        represents the end of
 683
 684 Fast Commit Replay Idempotence
 685 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 686
 687 Fast commits tags are idempotent in nature provided the recovery code follows
 688 certain rules. The guiding principle that the commit path follows while
 689 committing is that it stores the result of a particular operation instead of
 690 storing the procedure.
 691
 692 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
 693 was associated with inode 10. During fast commit, instead of storing this
 694 operation as a procedure "rename a to b", we store the resulting file system
 695 state as a "series" of outcomes:
 696
 697 - Link dirent b to inode 10
 698 - Unlink dirent a
 699 - Inode 10 with valid refcount
 700
 701 Now when recovery code runs, it needs "enforce" this state on the file
 702 system. This is what guarantees idempotence of fast commit replay.
 703
 704 Let's take an example of a procedure that is not idempotent and see how fast
 705 commits make it idempotent. Consider following sequence of operations:
 706
 707 1) rm A
 708 2) mv B A
 709 3) read A
 710
 711 If we store this sequence of operations as is then the replay is not idempotent.
 712 Let's say while in replay, we crash after (2). During the second replay,
 713 file A (which was actually created as a result of "mv B A" operation) would get
 714 deleted. Thus, file named A would be absent when we try to read A. So, this
 715 sequence of operations is not idempotent. However, as mentioned above, instead
 716 of storing the procedure fast commits store the outcome of each procedure. Thus
 717 the fast commit log for above procedure would be as follows:
 718
 719 (Let's assume dirent A was linked to inode 10 and dirent B was linked to
 720 inode 11 before the replay)
 721
 722 1) Unlink A
 723 2) Link A to inode 11
 724 3) Unlink B
 725 4) Inode 11
 726
 727 If we crash after (3) we will have file A linked to inode 11. During the second
 728 replay, we will remove file A (inode 11). But we will create it back and make
 729 it point to inode 11. We won't find B, so we'll just skip that step. At this
 730 point, the refcount for inode 11 is not reliable, but that gets fixed by the
 731 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
 732 into a series of idempotent outcomes, fast commits ensured idempotence during
 733 the replay.
 734
 735 Journal Checkpoint
 736 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 737
 738 Checkpointing the journal ensures all transactions and their associated buffers
 739 are submitted to the disk. In-progress transactions are waited upon and included
 740 in the checkpoint. Checkpointing is used internally during critical updates to
 741 the filesystem including journal recovery, filesystem resizing, and freeing of
 742 the journal_t structure.
 743
 744 A journal checkpoint can be triggered from userspace via the ioctl
 745 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
 746 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
 747 can be used to verify input to the ioctl. It returns error if there is any
 748 invalid input, otherwise it returns success without performing
 749 any checkpointing. This can be used to check whether the ioctl exists on a
 750 system and to verify there are no issues with arguments or flags. The
 751 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
 752 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
 753 discarded or zero-filled, respectively, after the journal checkpoint is
 754 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
 755 cannot both be set. The ioctl may be useful when snapshotting a system or for
 756 complying with content deletion SLOs.