jbd2: fix potential data lost in recovering journal raced with synchronizing fs bdev
authorZhihao Cheng <chengzhihao1@huawei.com>
Tue, 19 Sep 2023 01:25:25 +0000 (09:25 +0800)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Tue, 28 Nov 2023 17:20:04 +0000 (17:20 +0000)
commit 61187fce8600e8ef90e601be84f9d0f3222c1206 upstream.

JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
however, other process could intercept the EIO information from bdev's
mapping, which leads journal recovering successful even EIO occurs during
data written back to fs device.

We found this problem in our product, iscsi + multipath is chosen for block
device of ext4. Unstable network may trigger kpartx to rescan partitions in
device mapper layer. Detailed process is shown as following:

  mount          kpartx          irq
jbd2_journal_recover
 do_one_pass
  memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
  mark_buffer_dirty // mark bh dirty
         vfs_read
  generic_file_read_iter // dio
   filemap_write_and_wait_range
    __filemap_fdatawrite_range
     do_writepages
      block_write_full_folio
       submit_bh_wbc
            >>  EIO occurs in disk  <<
                     end_buffer_async_write
      mark_buffer_write_io_error
       mapping_set_error
        set_bit(AS_EIO, &mapping->flags) // set!
    filemap_check_errors
     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
 err2 = sync_blockdev
  filemap_write_and_wait
   filemap_check_errors
    test_and_clear_bit(AS_EIO, &mapping->flags) // false
 err2 = 0

Filesystem is mounted successfully even data from journal is failed written
into disk, and ext4/ocfs2 could become corrupted.

Fix it by comparing the wb_err state in fs block device before recovering
and after recovering.

A reproducer can be found in the kernel bugzilla referenced below.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230919012525.1783108-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fs/jbd2/recovery.c

index c269a7d29a46533ca107492d067657debaf7506c..5b771a3d8d9aeb7d561a51a735b89fd75d025e00 100644 (file)
@@ -289,6 +289,8 @@ int jbd2_journal_recover(journal_t *journal)
        journal_superblock_t *  sb;
 
        struct recovery_info    info;
+       errseq_t                wb_err;
+       struct address_space    *mapping;
 
        memset(&info, 0, sizeof(info));
        sb = journal->j_superblock;
@@ -306,6 +308,9 @@ int jbd2_journal_recover(journal_t *journal)
                return 0;
        }
 
+       wb_err = 0;
+       mapping = journal->j_fs_dev->bd_inode->i_mapping;
+       errseq_check_and_advance(&mapping->wb_err, &wb_err);
        err = do_one_pass(journal, &info, PASS_SCAN);
        if (!err)
                err = do_one_pass(journal, &info, PASS_REVOKE);
@@ -327,6 +332,9 @@ int jbd2_journal_recover(journal_t *journal)
 
        jbd2_journal_clear_revoke(journal);
        err2 = sync_blockdev(journal->j_fs_dev);
+       if (!err)
+               err = err2;
+       err2 = errseq_check_and_advance(&mapping->wb_err, &wb_err);
        if (!err)
                err = err2;
        /* Make sure all replayed data is on permanent storage */